Site Reliability Engineer
Job DescriptionJob Description
Tax Analysts is seeking a Site Reliability Engineer (SRE) to help establish and shape our reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven organization and play a key role in ensuring the reliability, scalability, and performance of our AWS-hosted business applications.
As part of a cross-functional engineering team, you will work to improve observability, automate operational processes, and lead incident response and continuous improvement efforts. This role is ideal for a mid-level engineer with cloud and software engineering experience who is eager to deepen their expertise in site reliability engineering, learn from senior staff, and help build a culture of reliability.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
- Help define and implement service-level indicators (SLIs) and objectives (SLOs) for cloud-based applications.
- Build, configure, and maintain monitoring, alerting, and dashboarding solutions using AWS CloudWatch, X-Ray, and third-party tools such as DataDome.
- Leverage advanced AWS observability tools (e.g., CloudWatch Synthetics, Contributor Insights) to proactively monitor system health.
- Contribute to the development and implementation of a structured on-call support process as our reliability practice evolves.
- Implement monitoring, and maintain site protection and bot mitigation solutions, including DataDome, to defend against automated attacks and ensure application availability, and analyze performance during postmortems of incidents.
- Investigate incidents, security events, and operational anomalies, resolve, perform root cause analysis, and run a postmortem process.
- Identify repetitive or manual operational tasks (‘toil’) and design scripts or automations using AWS Lambda and CloudFormation to improve efficiency and reliability.
- Assist in the maintenance and enhancement of CI/CD pipelines and automated deployment processes.
- Work closely with development, QA, cloud, and DevOps teams to ensure reliability, scalability, and security are integrated into system and application designs.
- Contribute to the documentation of systems, processes, incident learnings, compliance, and reliability best practices.
- Stay current with emerging AWS, SRE, and observability technologies, and make recommendations to adopt new tools or approaches that improve system resilience and operational excellence.
- Participate in the evaluation and rollout of new AWS services and features that can benefit system reliability or team efficiency.
- Perform other related duties as assigned to support the team and organizational objectives.
KNOWLEDGE & SKILLS:
- Strong analytical, troubleshooting, and problem-solving abilities.
- Hands-on experience with AWS CloudWatch (metrics, logs, dashboards, alarms) for proactive monitoring and alerting.
- Familiarity with AWS X-Ray for distributed tracing and in-depth troubleshooting of microservices architectures.
- Experience leveraging tools like CloudWatch Synthetics and Contributor Insights for canary testing and log analytics.
- Knowledge of AWS CloudTrail for auditing and investigating API calls and security events.
- Experience using AWS Athena for ad-hoc querying and analysis of logs during incident investigations and postmortems.
- Proficiency with AWS CloudFormation for reliable and repeatable infrastructure provisioning.
- Experience automating operational tasks and workflows using AWS Lambda or similar event-driven services.
- Understanding of AWS services such as API Gateway, CloudFront, and Elastic Load Balancer (ELB) to ensure availability, scalability, and optimal performance of distributed systems.
- Experience working with site protection and bot mitigation solutions (such as DataDome or Cloudflare).
- Working knowledge of scripting or programming such as Python, Bash, or Node.js for automation and tooling.
- Excellent communication and documentation skills; ability to collaborate effectively with cross-functional teams.
- Eagerness to learn and adopt new tools, technologies, and best practices in cloud reliability and operations.
Requirements
- Bachelor’s degree in computer science, engineering, or a related field; equivalent professional experience considered.
- 3+ years of professional experience in cloud engineering, DevOps, infrastructure, or observability roles (AWS required).
- Experience implementing SRE principles (prior work in an SRE role is a plus).
- Experience with monitoring, incident response, or reliability work in a production environment.
- Experience working in an Agile development environment, collaborating within cross-functional teams.
- Eagerness to help establish and improve site reliability practices while learning and applying best practices.
Benefits
- Health/Dental/Vision
- 401K: Immediately vested
- Tuition assistance
- Qualified employer under the Public Service Loan Forgiveness program (PFSL)
- Generous Paid Time Off
- Dog-friendly office
- Private gym onsite
- Medical, Dental, Vision Insurance
- Health Savings Account (HSA)
- Flexible Spending Account (FSA)
- Employee Assistance Program (EAP)
- Life and AD&D Insurance
- Insurance
- Pet Insurance
- Tuition Assistance
- Trade Publication/News Subscription Reimbursement
- Exercise Room
- Paid Holidays
- Vacation and Sick Leave
- Parental Leave
Tax Analysts is an Equal Employment Opportunity Employer.