Job Description

Principal Observability & Cloud Platform EngineerMost observability engineers run someone else's stack. This role is for the person who builds it.Our client is re-architecting observability and cloud infrastructure at a scale very few engineers ever touch: a ~3,000-node Kubernetes estate, 50TB of logs a day (around 600k logs/second) and up to 80 million active time-series, running multi-region and multi-cloud across AWS and GCP.You'll own the architecture: metrics, logs, traces, telemetry pipelines, service mesh and developer experience for thousands of services and millions of devices. You'll overhaul core open-source components, storage layers, query paths for performance, cost and reliability, and push improvements back upstream to CNCF projects.

This is hands-on architecture, not stack-sitting.What you'll need:Strong, hands-on Go in production, plus Python or Shell.Real scale: PB-level ingestion and hundreds of millions of active series, and you built or scaled it, not just watched it run.Depth across the open-source observability stack: Prometheus, Grafana, and large-scale metrics (Thanos, Mimir, Cortex or VictoriaMetrics); logs (Loki / ELK / OpenSearch); traces (Tempo).Kubernetes at multi-cluster scale, service mesh (Istio / Envoy), Terraform, and AWS and/or GCP.A track record of evolving storage and query architectures (TSDB, Parquet, distributed processing) for cost, scale and latency.Nice to have:OpenTelemetry / OpenMetrics standards work, CNCF open-source contributions, security-in-platform experience, and using AI tooling to cut toil.TPBN1_UKTJ

Principal Observability & Cloud Platform Engineer in Cambridge

Principal Observability & Cloud Platform Engineer in Cambridge

Share this job now