Lead Site Reliability Engineer Job Description

Lead Site Reliability Engineer Job Description Template

Our company is looking for a Lead Site Reliability Engineer to join our team.

Represent Couchbase in customer meetings and serve as a customer advocate in influencing product roadmap and improvements;
Own the end-to-end availability (SLO/SLA), reliability, and performance of Couchbase’s Cloud offerings;
Participate in 24×7 Site Reliability rotations and escalation workflows;
Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
Present quarterly operations review in addition to other more routine reporting obligations;
Establish an on-call cadence with the team and ensure adequate coverage areas;
Foster a healthy and collaborative culture, in line with Couchbase’ core values;
Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
Serve as a change board approver and incident manager.

A passion for SRE/DevOps and running highly resilient/automated systems;
Manage on-call rotations across continents, using a follow-the-sun model and handle incidence response to ensure high-availability;
BS/BE/Masters in Computer Science;
Regularly report on availability and incidents to senior management;
At least 7 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
Experience developing or integrating Chaos Engineering tool chains or methodologies;
Build a team culture to aim for high service availability, scalability and observability goals;
Bias towards data driven decisions and ensuring key metrics are agreed on, visible and actionable;
Deep working experience on cloud platforms like Amazon Web Services and open source software like Kubernetes, Prometheus, Datadog etc.