Using context to triage change-triggered incidents
One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical to get a first responder’s view of what happened and what could possibly have caused it. Context tells you what happened before an incident. In the case of 40–50% of all incidents, Zenduty’s incident context can tell you within 5–10 seconds, what could be the cause of an incident.
What causes incidents?
A few days back, I read a message from a thread in the Hangops Slack and I couldn’t help but chuckle.
Nowadays whenever I see the root cause was […] my brain automatically replaces with our false confidence convinced us that […]
Greg Sanker(@gtsanker) of Oregon State EIS did a small Twiter survey a few years ago.
Some of the best insights around the causes of incidents can be found in Subbu Allamaraju’s blog. In his study, he identified five key reasons behind incidents:
The only thing about "SLOs" that actually matters is "are you thinking about your users?" The rest of it is just math that helps you do that more efficiently when you have 100 other things going on. That's, like, the whole thing. Are your users happy and have they been over time?— Alex Hidalgo (THE BOOK IS DONE) (@ahidalgosre) June 11, 2020
What changed before an incident?
As you can see in the pie chart above, the single biggest contributing factor to incidents is Change. The first thing every incident responder looks for after having inspected the incident alert is Context **around what changes before an incident was triggered.** This was the feedback we got from our pilot customers, which is why we decided to bake in context into Zenduty from the get-go.
Plug:) If you’re a NOC, SRE or ITOps team and looking for an incident alerting, on-call management, and incident response service, sign up today on Zenduty for a 14-day trial.
Incident Context in Zenduty
Zenduty allows you to add “Context” integrations - alert sources those that will send contextual information, but not create any incidents in your service. These can be your Git records(Github, Bitbucket), CI/CD(Jenkins, Travis, CircleCI), upstream alert sources from services of other teams, config change events within your cloud(AWS, Azure, or GCP), or any other tool that will give your team information about your service. Here’s how it works.
Setup your service on Zenduty(guide) and add context integrations to your service. Whenever an incident is triggered within your service, you will be able to see all the contextual alerts(info) from your sources.
The image above shows an incident triggered by an incorrect database configuration change. The context for this incident comes from GitHub and Heroku(Cloud/CI/CD) which showed that a config change was pushed and deployed to master which caused the incident 15 minutes post-deployment. This context allows incident responders to deep-dive into possible root causes due to recent changes and also loop in the personnel who were part of the change into the incident response process. Context is especially useful when brought into your comms channels where all teams already reside.
Incident Context in Zenduty, if configured well, can help you back-track your incidents to a change in 10–20% of all your incidents. Context can also greatly help in further reducing your resolution time by ensuring that all possible change information is in your fingertips the moment you start triaging your Sev0 incident.
If you’re a NOC, SRE or ITOps team and looking for solid incident alerting, on-call, and incident response platform, do give Zenduty’s 14-day trial a spin.