Site Reliability Engineer Job Description Template
Our company is looking for a Site Reliability Engineer to join our team.
Responsibilities:
- Maintain and prepare reports for various activities and provide performance backup to retrieve data in emergencies;
- Perform appropriate tests and provide training to upgrade product quality and standardize all artifacts;
- Work with Security Managers to establish and document security controls and procedures;
- Administer all spheres of OC physical planning, provide security and backup for recovering systems;
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health;
- Build and evolve the operations handbook;
- Prepare designs and evaluate all balancing functions as required by Engineering departments and other functional areas;
- Promoting and applying best practices for building scalable and reliable services across engineering;
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity;
- Develop/Improve tools to automate the monitoring and resolution of production issues;
- Assist technical staff to check and ensure resolution of all issues to achieve objectives;
- Use/Improve existing tools for effective administration and monitoring of a large-scale web service on AWS cloud;
- Prepare and review all Service Level and Operational Metrics, and KPI scorecards for service delivery;
- Troubleshoot and resolve live production issues by analyzing logs/errors from different sources;
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
Requirements:
- Strong analytical and problem solving and organizational skills Excellent written and oral communication skills;
- Leadership skills: Sound problem resolution, judgment, negotiating and decision-making skills;
- Azure Dev Ops (ADO);
- Feedback/Metrics collection techniques to expose live site/service issues;
- Expertise in problem solving and analyzing global scale distributed systems and critical production service environments;
- Mastering CICD concepts and hands on implementations experiences, specifically GIT;
- A passion for building and participating in highly effective teams and development processes;
- Strong communication and interpersonal skills;
- Experience with Continuous Delivery practices;
- Degree in computer science equivalent to BTech from top institutes like IITs/IIITs/BITS;
- Good knowledge of OLTP database design concepts and RDBMS internals (Mysql is preferred);
- Ability to balance multiple tasks and projects effectively and quickly adapt to new variables;
- Ability to debug and optimize code and automate routine tasks;
- Has a proven track record of educational excellence;
- Have experience automating and running large scale production Java/Tomcat services in AWS (EC2, ECS, KMS, Kinesis, RDS) or other cloud providers.