Skip to main content

Site Reliability Engineer

Job DescriptionJob DescriptionSenior Site Reliability Engineer

  • Must be able to travel onsite periodically (Oak Ridge, TN)
  • Must be eligible for a Federal Security Clearance (US )

Major Duties/Responsibilities:

  • Lead ongoing improvements in reliability and scalability for our Kubernetes and Linux based applications and services.
  • Contribute as senior technical resource to define and implement best practices and standards for the center.
  • Provide primary operational support and engineering for production applications.
  • Define and implement define KPIs, processes and drive continuous improvement.
  • Influence the architecture and implementation of solutions.
  • Tune operating systems and applications to increase performance and reliability of services.
  • Mentor junior staff and enable them for success.
  • Diagnose system operational problems quickly and effectively.
  • Participate in on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows.
  • Coordinate with vendors to resolve hardware and software problems.
  • Deliver mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, , and accessibility by fostering a respectful workplace – in how we treat one another, work together, and measure success.

Basic Qualifications:Bachelor’s Degree in computer science or closely related field and a minimum of 8 years of experience as an SRE/Systems Engineer. An equivalent combination of education and experience may be considered. Qualifications:

  • Excellent interpersonal/communication skills, and the ability to work as part of a team.
  • Strong working knowledge of Unix system fundamentals and common network protocols.
  • Experience managing Linux/UNIX operating systems in a heterogeneous environment.
  • Solid understanding of networked computing environment concepts.
  • Ability to develop and maintain programs and scripts that aid in the operation and automation using various shell (primarily bash) and high-level (Python or Go).
  • Ability to proactively identify performance issues, problems, and areas for improvement.
  • Ability to identify requirements and to define, plan, and implement requisite solutions.
  • Ability to plan, organize, prioritize tasks, and complete assigned projects with minimal supervision.
  • Experience with continuous integration and continuous deployment software methodologies and how they apply to SRE/systems engineering.
  • An understanding of code review and familiarity with tools like GitHub and GitLab
  • Experience using tools such as Nagios, Grafana and Prometheus to monitor systems, metrics, and create dashboards.
  • Experience designing and implement highly available systems/services utilizing virtual machines and Kubernetes resources.
  • Experience participating in an opensource community with patches accepted upstream.
  • Experience deploying and maintaining automated configuration management software such as Puppet or Ansible
  • Experience implementing systems-level security technologies like SELinux and following security best practices.


Site Reliability Engineer

Oak Ridge, TN
Full time

Published on 07/15/2025

Share this job now