Site Reliability Engineer

Site Reliability Engineer Job Description Template

Our company is looking for a Site Reliability Engineer to join our team.

Responsibilities:

  • Maintain and prepare reports for various activities and provide performance backup to retrieve data in emergencies;
  • Perform appropriate tests and provide training to upgrade product quality and standardize all artifacts;
  • Work with Security Managers to establish and document security controls and procedures;
  • Administer all spheres of OC physical planning, provide security and backup for recovering systems;
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health;
  • Build and evolve the operations handbook;
  • Prepare designs and evaluate all balancing functions as required by Engineering departments and other functional areas;
  • Promoting and applying best practices for building scalable and reliable services across engineering;
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity;
  • Develop/Improve tools to automate the monitoring and resolution of production issues;
  • Assist technical staff to check and ensure resolution of all issues to achieve objectives;
  • Use/Improve existing tools for effective administration and monitoring of a large-scale web service on AWS cloud;
  • Prepare and review all Service Level and Operational Metrics, and KPI scorecards for service delivery;
  • Troubleshoot and resolve live production issues by analyzing logs/errors from different sources;
  • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.

Requirements:

  • Strong analytical and problem solving and organizational skills Excellent written and oral communication skills;
  • Leadership skills: Sound problem resolution, judgment, negotiating and decision-making skills;
  • Azure Dev Ops (ADO);
  • Feedback/Metrics collection techniques to expose live site/service issues;
  • Expertise in problem solving and analyzing global scale distributed systems and critical production service environments;
  • Mastering CICD concepts and hands on implementations experiences, specifically GIT;
  • A passion for building and participating in highly effective teams and development processes;
  • Strong communication and interpersonal skills;
  • Experience with Continuous Delivery practices;
  • Degree in computer science equivalent to BTech from top institutes like IITs/IIITs/BITS;
  • Good knowledge of OLTP database design concepts and RDBMS internals (Mysql is preferred);
  • Ability to balance multiple tasks and projects effectively and quickly adapt to new variables;
  • Ability to debug and optimize code and automate routine tasks;
  • Has a proven track record of educational excellence;
  • Have experience automating and running large scale production Java/Tomcat services in AWS (EC2, ECS, KMS, Kinesis, RDS) or other cloud providers.