Lead Site Reliability Engineer

Lead Site Reliability Engineer Job Description Template

Our company is looking for a Lead Site Reliability Engineer to join our team.

Responsibilities:

  • Represent Couchbase in customer meetings and serve as a customer advocate in influencing product roadmap and improvements;
  • Own the end-to-end availability (SLO/SLA), reliability, and performance of Couchbase’s Cloud offerings;
  • Participate in 24×7 Site Reliability rotations and escalation workflows;
  • Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
  • Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
  • Present quarterly operations review in addition to other more routine reporting obligations;
  • Establish an on-call cadence with the team and ensure adequate coverage areas;
  • Foster a healthy and collaborative culture, in line with Couchbase’ core values;
  • Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
  • Serve as a change board approver and incident manager.

Requirements:

  • A passion for SRE/DevOps and running highly resilient/automated systems;
  • Manage on-call rotations across continents, using a follow-the-sun model and handle incidence response to ensure high-availability;
  • BS/BE/Masters in Computer Science;
  • Regularly report on availability and incidents to senior management;
  • At least 7 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
  • Experience developing or integrating Chaos Engineering tool chains or methodologies;
  • Build a team culture to aim for high service availability, scalability and observability goals;
  • Bias towards data driven decisions and ensuring key metrics are agreed on, visible and actionable;
  • Deep working experience on cloud platforms like Amazon Web Services and open source software like Kubernetes, Prometheus, Datadog etc.