Ed Tech companies you'll love to work for

companies
Jobs

Senior Site Reliability Engineer (SRE)

EF Education First

EF Education First

Software Engineering
Shanghai, China · China
Posted on Nov 3, 2025

Role

Own reliability for our AWS-based platform (EKS, RDS, MSK, Lambda, Serverless etc.).

Lead incident response, SLO/error-budget practice, and automation that reduces toil and cost.

Partner with product/engineering to ship fast and safely.

What you’ll do

  • Incidents: Lead major incidents, comms, and blameless postmortems; codify auto-remediation playbooks.
  • Edge & networking: Operate Cloudflare/Ali ESA (CDN, WAF, Rate Limiting, Bot Mgmt). Design caching, cache-keys, purge flows, origin shielding, and failover.
  • Automation & IaC: Build/maintain Terraform/CDK modules, and CI/CD guardrails; zero-touch rollouts/rollbacks.
  • Observability: Own metrics, logs, traces, and alert quality (high signal, low noise) with Datadog/Prometheus/Grafana/ELK.
  • Performance & cost: Profile and tune latency/throughput; right-size clusters; optimize spend (compute, storage, data egress).
  • Security by default: Least-privilege IAM, secrets, runtime policies, keep prod compliant.
  • Technical leadership: Mentor engineers, review designs, influence architecture, and evolve SRE roadmap.

Minimum qualifications

  • Experience: 2+ years in SRE/Platform/Infra (or equivalent impact), including leading incidents and on-call.
  • Systems: Strong Linux, containers/Kubernetes; deep AWS (EKS, VPC, IAM).
  • Code: Proficient in Go, NodeJS or Python; you’ve built automation/services that removed real toil.
  • Ops craft: Capacity planning, chaos/testing, postmortems, and CI/CD for microservices.

Nice to have

  • AWS certifications (SAA/SAP), Karpenter/Cluster Autoscaler, Argo CD/GitOps, Kafka ops.
  • DR/failover design, multi-region patterns.