Site Reliability & Observability Engineer (Remote)

360training

360training

United States · Remote

Posted 6+ months ago
Position: Site Reliability & Observability Engineer (Remote)
Location: USA, Remote
Job Id: 456
# of Openings: 1

Why 360training?
At 360training, we’re more than just a leader in online training—we’re helping people unlock their potential and shape their futures. For over two decades, we’ve empowered millions of learners with regulatory-approved training across industries, making it possible for individuals to get the jobs they want and keep the careers they love.
Our success is built on two simple but powerful values: Deliver Results and Do the Right Thing. They’re not just words on a wall—they guide how we work, collaborate, and grow together. At 360training, you’ll join a passionate team that tests in your development, rewards your results, and supports you personally and professionally.
If you’re looking for a career where you can make an impact, grow quickly, and be valued every step of the way—this is your chance.
Site Reliability & Observability Engineer
360training is seeking a Site Reliability & Observability Engineer to build and scale our observability and reliability practices across cloud, container, and application environments. This role will be responsible for developing the systems, tools, and processes that ensure application performance, reliability, and visibility across multiple platforms and brands.
The SRE will partner closely with DevOps and Development teams to define service-level objectives (SLOs), establish automated monitoring and alerting, and drive performance optimization across infrastructure and applications. This individual will also play a critical role in incident response, postmortems, and the ongoing evolution toward a data-driven reliability culture.
Our ideal candidate is a hands-on engineer with experience in application performance monitoring (APM), metrics, tracing, and logging, and a strong background in automation and cloud-native observability tooling.
Key Responsibilities:
Observability Platform Development
  • Design, implement, and manage the enterprise-wide observability stack (APM, metrics, logs, and traces) across Azure and containerized workloads.
  • Deploy and maintain monitoring tools to ensure full-stack visibility.
  • Build standardized dashboards, alerts, and KPIs for key services and business applications.
  • Develop and maintain automation for telemetry data collection, alert configuration, and dashboard provisioning.
  • Ensure coverage for application, infrastructure, and end-user experience monitoring across all environments.
Reliability Engineering
  • Define and maintain Service-Level Objectives (SLOs), Service-Level Indicators (SLIs), and Error Budgets in partnership with DevOps and Development teams.
  • Implement automated incident detection, alerting, and response playbooks to reduce MTTR.
  • Analyze recurring incidents and drive permanent fixes and reliability improvements.
  • Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.
Performance & Cost Optimization
  • Establish performance baselines and track resource utilization across cloud and container infrastructure.
  • Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
  • Monitor and optimize monitoring metrics ingestion, Azure Log Analytics, and storage costs to balance visibility with efficiency.
Incident Management & Postmortems
  • Serve as a key responder during major incidents, providing data-driven insights and remediation coordination.
  • Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
  • Build dashboards and analytics to identify leading indicators of failure and performance degradation.
  • Improve operational playbooks to accelerate detection and recovery.
Automation & Continuous Improvement
  • Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
  • Continuously evaluate emerging observability tools and practices for adoption.
  • Advocate for reliability and monitoring best practices across engineering teams.
Required Skills:
  • 5+ years of experience in Site Reliability, Observability, or DevOps Engineering roles.
  • Strong hands-on experience with observability tools such as Datadog, New Relic, Grafana, ELK/EFK, or equivalent.
  • Deep understanding of metrics, tracing, and logging concepts and their correlation across distributed systems.
  • Experience implementing Synthetics and RUM monitoring for frontend performance.
  • Experience defining and managing SLOs, SLIs, and Error Budgets.
  • Solid grasp of Azure infrastructure, Kubernetes (AKS), and container monitoring.
  • Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
  • Excellent analytical and communication skills; able to translate complex data into actionable insights.
Preferred Skills:
  • Understanding of distributed tracing in microservice architectures.
  • Experience with fronted website performance tuning/optimization based on core web vitals
  • Strong scripting and automation skills (Python, PowerShell, or Bash).
  • Experience with incident management and RCA processes.

Apply for this Position