Ed Tech companies you'll love to work for

jobs · companies

Senior Site Reliability Engineer (SRE)

EF Education First

This job is no longer accepting applications

Software Engineering

Shanghai, China · China

Posted 6+ months ago

Role

Own reliability for our AWS-based platform (EKS, RDS, MSK, Lambda, Serverless etc.).

Lead incident response, SLO/error-budget practice, and automation that reduces toil and cost.

Partner with product/engineering to ship fast and safely.

What you’ll do

Incidents: Lead major incidents, comms, and blameless postmortems; codify auto-remediation playbooks.
Edge & networking: Operate Cloudflare/Ali ESA (CDN, WAF, Rate Limiting, Bot Mgmt). Design caching, cache-keys, purge flows, origin shielding, and failover.
Automation & IaC: Build/maintain Terraform/CDK modules, and CI/CD guardrails; zero-touch rollouts/rollbacks.
Observability: Own metrics, logs, traces, and alert quality (high signal, low noise) with Datadog/Prometheus/Grafana/ELK.
Performance & cost: Profile and tune latency/throughput; right-size clusters; optimize spend (compute, storage, data egress).
Security by default: Least-privilege IAM, secrets, runtime policies, keep prod compliant.
Technical leadership: Mentor engineers, review designs, influence architecture, and evolve SRE roadmap.

Minimum qualifications

Experience: 2+ years in SRE/Platform/Infra (or equivalent impact), including leading incidents and on-call.
Systems: Strong Linux, containers/Kubernetes; deep AWS (EKS, VPC, IAM).
Code: Proficient in Go, NodeJS or Python; you’ve built automation/services that removed real toil.
Ops craft: Capacity planning, chaos/testing, postmortems, and CI/CD for microservices.

Nice to have

AWS certifications (SAA/SAP), Karpenter/Cluster Autoscaler, Argo CD/GitOps, Kafka ops.
DR/failover design, multi-region patterns.

This job is no longer accepting applications