Site Reliability Engineer
Job DescriptionJob DescriptionSenior Site Reliability Engineer
- Must be able to travel onsite periodically (Oak Ridge, TN)
- Must be eligible for a Federal Security Clearance (US )
Major Duties/Responsibilities:
- Lead ongoing improvements in reliability and scalability for our Kubernetes and Linux based applications and services.
- Contribute as senior technical resource to define and implement best practices and standards for the center.
- Provide primary operational support and engineering for production applications.
- Define and implement define KPIs, processes and drive continuous improvement.
- Influence the architecture and implementation of solutions.
- Tune operating systems and applications to increase performance and reliability of services.
- Mentor junior staff and enable them for success.
- Diagnose system operational problems quickly and effectively.
- Participate in on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows.
- Coordinate with vendors to resolve hardware and software problems.
- Deliver mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, , and accessibility by fostering a respectful workplace – in how we treat one another, work together, and measure success.
Basic Qualifications:Bachelor’s Degree in computer science or closely related field and a minimum of 8 years of experience as an SRE/Systems Engineer. An equivalent combination of education and experience may be considered. Qualifications:
- Excellent interpersonal/communication skills, and the ability to work as part of a team.
- Strong working knowledge of Unix system fundamentals and common network protocols.
- Experience managing Linux/UNIX operating systems in a heterogeneous environment.
- Solid understanding of networked computing environment concepts.
- Ability to develop and maintain programs and scripts that aid in the operation and automation using various shell (primarily bash) and high-level (Python or Go).
- Ability to proactively identify performance issues, problems, and areas for improvement.
- Ability to identify requirements and to define, plan, and implement requisite solutions.
- Ability to plan, organize, prioritize tasks, and complete assigned projects with minimal supervision.
- Experience with continuous integration and continuous deployment software methodologies and how they apply to SRE/systems engineering.
- An understanding of code review and familiarity with tools like GitHub and GitLab
- Experience using tools such as Nagios, Grafana and Prometheus to monitor systems, metrics, and create dashboards.
- Experience designing and implement highly available systems/services utilizing virtual machines and Kubernetes resources.
- Experience participating in an opensource community with patches accepted upstream.
- Experience deploying and maintaining automated configuration management software such as Puppet or Ansible
- Experience implementing systems-level security technologies like SELinux and following security best practices.