What is an incident response playbook?

Incident response playbooks are a set of actions that need to be executed by your incident repsonders depending on the nature of the outage. Having well defined incident response playbooks can be extremely critical, especially during high customer impact events, that you would typically classify as Sev-0 incidents.

Playbooks(and Runbooks) are generally created over a period of time, by studying incident and alert patterns on Azure Monitor, both for incidents that occur frequently, and those that occur once in a while. As the speed of deployment accelerates with time, most teams have started to mandate that an incident response playbook be created during the incident postmortem phase itself. Most of these playbooks reside as text documents within the company’s cloud storage, KMs, collaboration service or git repositories.

The challenges

Search for the right playbook during an incident

While building an incident playbook is step one, the challenge comes in step two - makeing it accessible and searchable during a live incident. A major challenge that responders face is searching for the right playbook for a specific incident. Searching for the right playbook can shave a couple of minutes from your resolution time and consequently affect your SLAs.

Execution of playbook tasks

As an incident commander, it is important that while you organize your incident responders and manage communications, you are able to map out the incident response journey. Typically, during Sev-0 incidents, you have a large number of responders and subject matter experts and stakeholders in your comms channels(Microsoft Teams, Skype, Slack) and it might get a bit overwhelming while going through your playbook, delegating specific tasks(as outlined in the playbook) to specific people, following up on those tasks and most importantly, ensuring that there is recursive separation of responsibilities and zero redundancy. The incident commander has to constantly keep an eye on the playbook and ensure that the responders are executing all the steps as outlined in the playbook, that the requisite authorizations are in place, and also keep the stakeholders updated on the status of the incident.

How NOC teams can attach playbooks directly to Azure alerts with Zenduty

First, a quick primer on Zenduty - Zenduty is an end-to-end incident alerting, on-call management, and response orchestration platform. Zenduty integrates with Azure Monitor and 100+ other services, and alerts and escalate critical alerts to the right people, and helps you minimize your mean-time-to-repair. You can sign up for a free trial here and setup your on-call schedules, escalation policies and Azure Monitor integration here.

A bulk of our customers who are building and deployed on Azure leverage Zenduty’s task templates to auto-attach their playbooks to their Azure incident alerts. Zenduty allows you to take your existing incident response playbook document and create a “task template”, comprising of discrete role-mapped tasks. Each of your playbooks can be converted into a task template on Zenduty. Once you create the task template, you can map those task templates to specific services. As the incidents start pouring in from Azure Monitor, Zenduty will automatically take the tasks from the task template and append them to your incident tasks. As the incident commander assigns the roles to various responders, Zenduty will assign the role-mapped tasks to the respective responders.

Incident response with task templates

To set up playbooks, the service owner should create task templates for specific service. For example, the service owner can create a task template to deal with database issues.

The task template will comprise of sequential tasks that need to be executed by the team during a database incident. These tasks can be role-mapped so that when the incident commander designates specific roles to specific people, Zenduty will auto-assign those tasks to those people depending on the roles assigned to them. You can write the task in markdown. Alternatively, if you prefer to host your playbook in your own knowledge repositories, you can add a link to those playbook documents.

When an incident is triggered, Zenduty will automatically take all the tasks from the associated task template and append them to the incident tasks. That way, when the on-call engineer sees the alert, they can navigate to the incident tasks section and start triaging and remediating the incident.

How it all comes together

When Azure monitor creates an alert, an incident is created on Zenduty. Zenduty alerts the service engineers(via SMS, Phone/Voice, Slack, Microsoft Teams, Android/iOS push notification) as defined by the escalation policies.

Task templates also make it insanely easy to track the progress of an incident and collaborate effectively during incident response within your team collaboration platforms like Slack and Microsoft Teams.

Conclusion

Task templates can dramatically increase your incident preparedness, decrease on-call anxiety, reduce RCA/response errors, and improve response times. Task templates also ensure that all the steps within your incident playbooks are executed and provides a solid framework to your incident commanders to effectively delegate specific tasks to subject matter experts. Task templates can also help stakeholders - upper management, sales teams, support teams, and account managers track the progress of the incident