The term SRE, or Site Reliability Engineering has been around for over a decade. SRE aims to create ultra scalable and highly reliable software systems by incorporating priciples from software engineering into infrastructure and operations problems. There is much room for increase in adoption of SRE across industry verticles and change in the way SRE is practiced today.
The SRE Report 2019 published by Catchpoint highlights some of the focus areas for improvement. This article discusses the key findings of the report, especially with the focus on avoiding burnout for Site Reliability Engineers.
SRE is still an emerging practice
64% of the respondents of the survey by Catchpoint said that the SRE role in their company has been in existance for less than 3 years. The meaning of the word "reliabile" can depend on the context, and organizations are still in the process of defining these targets and processes to reach them. Over 57% of the respondents reported the size of the SRE team to be less than 10. This strain on a few individuals can lead to excessive toil and burnout in them.
Insufficent Automation: In the same survey, 59% of the respondents believe that there is too much toil in their jobs. "Toil" refers to manual, repititive, automatable work. Almost half the respondents believe that insufficient automation in maintainance tasks is one of the main sources of toil.
Lack of Clear Objectives: About 48% of the respondents claimed that Service Level Objectives remain undefined for essential services for their organization. Lack of clarity on objectives can lead to drop in key metrics, and overall reduction in customer satisfaction, lost revenue, drop in employee productivity (when internal apps face down-time) and other negative repucussions.
Given the nascency of Site-Reliability practices, organizations adopting SRE must be clear on the Service-Level-Objectives (SLOs) from the very beginning, and review these objectives periodically. Tasks that can be automated, should be. Organizations and leads/managers must be careful in picking monitoring tools. Zenduty connects with all your monitoring incidents and records incidents in one place so your team can track them better.
One of the main responsibilities of SREs is Incident Resolution
Fixing an unplanned interruption to a service is one of the key responsibilty of SREs. Ideally, monitoring and alerting tools should be used to flag, track and post-analuze incidents. This automation is not implemented in a few of the SRE teams. Almost 50% of the respondents reported having worked on outages lasting more than a day at some point.
When there are too few people working on incident resolution, each of them experiences a great degree of stress. Creatr slerting systems that can automate incident reporting and assigning, rather than having colleagues flag these errors.
Finally, companies should examine patterns of error occurence (for example, check if there is a surge in incidents after a big deployment), and allocate suficient resources accordingly.
Incident Resolution causes stress
Post incident stress (upto 2 days after an incident) is commonplace among SREs. Those who are the lone SREs in their organization suffer from stress almost after every incident. Most SREs face a change in their mood, ability to concentrate or sleep and even appetite post an incident. It is clear that the process of incident-resolution needs to be more user friendly.
Automation and good use of alerting systems can reduce stress. Of the respondents that always experience post-incident stress, 20% discover incidents from users contacting the help desk compared to 2-3% for those who never or sometimes experience post-incident stress.
Alert Fatigue may be a cause of stress even if you have alerting tools in place. Check for false alerts, suppress unactionable warnings, and make sure that you are monitoring all crucial metrics. Zenduty's Alert Rules can help you set rules for notifications from each application.
SREs need support from their teams and companies to alleviate stress
Around 10% of the respondents reported that their company does "nothing" to alleviate post-incident stress. Creating a blame-less culture and giving incident respondents some time off after a major outage can help them feel better.
However, it is even more important to reduce stress occuring in the first place. Best practises regarding testing before and during deployment can prevent some incidents from occuring.
Regular post incident reviews can also help alleviate stress. They can also help identify if new resources need to be added to the SRE teams. Knowledge sharing among team members may also help members better respond to incidents and feel less stressed overall.
Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.