Skip to main content

Site Reliability Engineer

Job DescriptionOverview

Our client is seeking a Site Reliability Engineer (SRE) to design and build production configuration and deployment tools for the high-frequency trading (HFT) platform. This role is critical in ensuring infrastructure stability, scalability, and automation. The ideal candidate will have extensive experience creating complex, production-focused tools, with an emphasis on reliability and performance.

Key Responsibilities

  • Develop and maintain scalable production tools to automate deployment, monitoring, and infrastructure management.
  • Improve system reliability, performance, and efficiency through automation and tooling.
  • Work closely with trading and development teams to ensure seamless operation of live trading systems.
  • Manage configuration and deployment processes across AWS-based infrastructure.
  • Implement observability tools to enhance system monitoring and debugging capabilities.
  • Ensure fault tolerance, redundancy, and high availability for critical trading systems.
  • Support and enhance infrastructure for both C++- and Rust-based trading systems, ensuring seamless integration.


Required Qualifications

  • Strong programming skills in Python, with the ability to read and understand C/C++ code.
  • Deep understanding of Linux systems.
  • Experience managing deployments and configuration management in AWS and/or on-premise clusters.
  • Proficiency in monitoring, logging, and alerting solutions to maintain high system uptime.
  • Strong background in networking fundamentals, including TCP/IP and system performance tuning.
  • Experience with scripting (e.g. Python, Bash) for automation.


Skills

  • Familiarity with IaC tools, such as Terraform or Ansible, for infrastructure automation.
  • Experience in low-latency or high-performance environments is a plus but not required.
  • Strong problem-solving skills and the ability to work in a highly collaborative team.


Soft Skills & Culture Fit

  • Candidates from top-tier institutions or recognized as domain experts are .
  • Strong analytical skills and ability to work in high-pressure, real-time environments.
  • Collaborative team player who enjoys solving complex engineering problems.



Whilst we carefully review all applications, to all jobs, due to the high volume of applications we receive it is not possible to respond to those who have not been successful.

Contact
If this sounds like you, or you'd like more information, please get in touch:

George Hutchinson-Binks
george.hutchinson-binks@oxfordknight.co.uk
(+44) 07885 545220
linkedin.com/in/george-hutchinson-binks-a62a69252

Site Reliability Engineer

London, UK
Full time

Published on 07/03/2025

Share this job now