Senior Site Reliability Engineer (SRE)
EF Education First
Software Engineering
Shanghai, China · China
Posted on Nov 3, 2025
Role
Own reliability for our AWS-based platform (EKS, RDS, MSK, Lambda, Serverless etc.).
Lead incident response, SLO/error-budget practice, and automation that reduces toil and cost.
Partner with product/engineering to ship fast and safely.
What you’ll do
- Incidents: Lead major incidents, comms, and blameless postmortems; codify auto-remediation playbooks.
- Edge & networking: Operate Cloudflare/Ali ESA (CDN, WAF, Rate Limiting, Bot Mgmt). Design caching, cache-keys, purge flows, origin shielding, and failover.
- Automation & IaC: Build/maintain Terraform/CDK modules, and CI/CD guardrails; zero-touch rollouts/rollbacks.
- Observability: Own metrics, logs, traces, and alert quality (high signal, low noise) with Datadog/Prometheus/Grafana/ELK.
- Performance & cost: Profile and tune latency/throughput; right-size clusters; optimize spend (compute, storage, data egress).
- Security by default: Least-privilege IAM, secrets, runtime policies, keep prod compliant.
- Technical leadership: Mentor engineers, review designs, influence architecture, and evolve SRE roadmap.
Minimum qualifications
- Experience: 2+ years in SRE/Platform/Infra (or equivalent impact), including leading incidents and on-call.
- Systems: Strong Linux, containers/Kubernetes; deep AWS (EKS, VPC, IAM).
- Code: Proficient in Go, NodeJS or Python; you’ve built automation/services that removed real toil.
- Ops craft: Capacity planning, chaos/testing, postmortems, and CI/CD for microservices.
Nice to have
- AWS certifications (SAA/SAP), Karpenter/Cluster Autoscaler, Argo CD/GitOps, Kafka ops.
- DR/failover design, multi-region patterns.