Skip to main content

Site Reliability Engineer in Houston

Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub. We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy and engineering jobs, and work with the leading energy companies worldwide.

We focus on the Oil & Gas, Renewables, Engineering, Power, and Nuclear markets as well as emerging technologies in EV, Battery, and Fusion. We are committed to ensuring that we offer the most exciting career opportunities from around the world for our jobseekers.

Job DescriptionJob DescriptionSalary:

About nClouds:


nClouds is a credentialed, award-winning provider of DevOps and cloud professional services, products, and solutions, specializing in modern infrastructures on AWS. We work as an extension of our clients and love tackling their stickiest challenges. All so our clients can deliver innovation faster and create awesome customer experiences.


Job Summary:

The SRE team is responsible for availability, reliability, performance, monitoring, change-management, emergency response for infrastructure or applications, and reducing manual work by implementing SRE principles and practices. SRE team directly works with Devs/DevOps teams, Operations teams, Product teams, and other teams to deploy new features, changes, and maintain infrastructure, operations, CI/CD, IAC to achieve availability and reliability so that SLOs and SLAs can be protected. We utilize a variety of DevOps automation tools like Ansible, Docker, Kubernetes, Terraform, Jenkins, along with cloud vendor-specific tools like ECS, Cloudformation, EKS, Opsworks, beanstalk. The SRE engineer is capable of implementing Observability, SLO, SLI, SLA, and Disaster Recovery and Backup Plans in cloud environments mainly AWS.


Key Responsibilities:

  • Ensure the availability and reliability of distributed systems.
  • Help the L1 team to resolve the clients infrastructure/system issues, escalations, alerts, tickets, and queries.
  • Works as a bridge between DevOps and other teams in order to build maintain resilient systems.
  • Conduct, coordinate and oversee post incident Root Cause Analysis / Reviews.
  • Build and maintain documentation for all assigned clients / projects.
  • Leverage DevOps, Agile methodology, and standards in day-to-day work.
  • Adopt and propose automation of repetitive tasks to reduce/eliminate toil.
  • Implement and troubleshoot using observability tools like Datadog, New Relic, Splunk, CloudWatch etc.
  • Adopt and ensure the SRE practices in Team.
  • Maintenance of AWS managed resources, CI/CD, IAC.
  • Planning and implementing disaster recovery and backup plans for AWS cloud platforms.
  • Proactively work on efficiency and capacity planning.
  • Keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Liaise and work closely with Layer-1 Oncall support, DevOps and Operations teams
  • Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues.



Qualifications:

  • Bachelors degree in computer science () or equivalent management, technical, scientific discipline
  • Ability to program (structured and OO) with one or more high level , such as Python, Java, C/C++, Ruby, and JavaScript
  • Clear understanding of SRE principles and practices and Agile and DevOps methodologies.
  • Experience in AWS Well-Architected framework in order to implement the scalable and reliable infrastructure.
  • Great team player with flexibility to work.
  • Excellent written/verbal communication and leadership skills.

If you are interested in applying for this job please press the Apply Button and follow the application process. Energy Jobline wishes you the very best of luck in your next career move.

Site Reliability Engineer in Houston

Houston, TX
Full time

Published on 11/05/2025

Share this job now