Lead Site Reliability Engineer Job Description Template
Our company is looking for a Lead Site Reliability Engineer to join our team.
Responsibilities:
- Represent Couchbase in customer meetings and serve as a customer advocate in influencing product roadmap and improvements;
- Own the end-to-end availability (SLO/SLA), reliability, and performance of Couchbase’s Cloud offerings;
- Participate in 24×7 Site Reliability rotations and escalation workflows;
- Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
- Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
- Present quarterly operations review in addition to other more routine reporting obligations;
- Establish an on-call cadence with the team and ensure adequate coverage areas;
- Foster a healthy and collaborative culture, in line with Couchbase’ core values;
- Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
- Serve as a change board approver and incident manager.
Requirements:
- A passion for SRE/DevOps and running highly resilient/automated systems;
- Manage on-call rotations across continents, using a follow-the-sun model and handle incidence response to ensure high-availability;
- BS/BE/Masters in Computer Science;
- Regularly report on availability and incidents to senior management;
- At least 7 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
- Experience developing or integrating Chaos Engineering tool chains or methodologies;
- Build a team culture to aim for high service availability, scalability and observability goals;
- Bias towards data driven decisions and ensuring key metrics are agreed on, visible and actionable;
- Deep working experience on cloud platforms like Amazon Web Services and open source software like Kubernetes, Prometheus, Datadog etc.