Incident Response 2.0 — The Zenduty Incident Command System(ICS)
We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response. The Slack-ICS was really something we thought of the moment Slack released its game-changing **Block-Kit UI Framework, and let us upgrade our Slack integration from somewhat of a Bruce Banner to the incredible Hulk. A special shoutout to the DevRel team at Slack, especially Emily Pinkerton, Gregg Rybicki and [Aman Gupta] for helping us navigate through the Block kit framework and selecting the right combination of UI ingredients.
The Slack-ICS was the next logical implementation in our Zenduty-ICS solution set after Incident Roles, Incident Tasks, Task Templates, and Incident Tags. The goal of the Zenduty-ICS system is to implement a recursive separation of responsibilities, a recognized chain of command, to manage and assign incident tasks by roles or an ad-hoc basis, and establish a clear path to resolution post-incident acknowledgment.
The before(pre-ICS) experience
Before our Zenduty-ICS framework was shipped early November, our Slack user experience was, in my opinion, minimal —but pretty much at par with our competitors. An incident is triggered, alerts were sent to phone, SMS, Email, Slack, and a message was sent to a dedicated Slack channel. Actions on this Slack message were restricted to acknowledging, resolving and reassigning, with alerts being sent for any changes to the state of the incident.
There were a couple of problems, especially involving critical incidents and multiple responders/stakeholders:
If there’s no proper team-wide incident response training whereby someone declares themselves as commander, nobody will know who is leading the incident. (Ack’er != Commander).
There were no standard operating procedures or checklists in place. For teams that have been through training, users might declare roles in the channel and scurry over to an incident response document or a runbook.
There was no way for a commander(if designated) to delegate tasks. There were mechanisms to assign concrete roles, tasks were assigned ad-hoc, follow-throughs were forgotten, nobody knew who was working on what unless they read through a bunch of message threads.
Incident Management 2.0 — The Zenduty-ICS experience
Last month, we introduced the four central pieces of the Zenduty incident response framework: Tasks, Roles, Tags, and Task Templates.
These four pieces combined will give anyone the entire high-level overview of the progress of any incident. Whenever an incident is triggered in Zenduty, the on-call engineer, after receiving the page, may acknowledge the incident and then claim the incident commander or ask someone else to take command. The incident commander will then add relevant responders, invite users to take up specific roles, assign tasks to those roles and update the incident status on the go.
Depending on your org structure and business, you can define roles like :
Incident Commander — leads the incident, and assigns roles and tasks to different people
Comms Lead — communicates about the incident to customers and stakeholders
Ops Lead — making changes and pushing the fix
The Task Template
The task template is the starting point for any incident. In the task template, you can define standard operating procedures and point them to specific roles. When a particular service encounters an incident, the task from the task template will automatically be added to the incident task list and after the incident commander assigns roles to specific individuals, these individuals will automatically inherit the tasks in the incident. At any given point, anybody in the team or company will know exactly who is leading the incident and who is dealing with specific aspects of the incident.
Incident tags let you classify the incident based on component, service, priority, customers or any other segment. These tags can then be used in your analytics dashboard to measure tag-wise service, team, and user performance.
Anatomy of an incident with ICS
There are two ways in which an incident can be created on Zenduty— automatically through a service integration/API or manually by a user via the Zenduty dashboard or Slack slash command. Once an incident is triggered, Zenduty will automatically run the escalation policies and Email/SMS/Call/Slack the folks on-call. In the meantime, Zenduty will create a dedicated Slack channel for the incident, a Jira ticket, a Zoom/Webex/Hangout room and a Statuspage incident, the links to which are added to the incident’s Comms section. Zenduty will then automatically take all the tasks from the task template associated with the service and add them to the incident tasks. If any alert rules and actions are configured, it can automatically assign a user or tags as well.
Once the Slack channel is created, Zenduty will automatically add the on-call engineer. The on-call engineer can acknowledge the alert and then check the service context alerts for some preliminary RCA, and start roping in additional engineers(by adding them as responders and Zenduty will notify them via SMS/Call/Email/Slack)to assist with the incident. Zenduty automatically adds these responders to the incident channel. At this point, the on-call engineer can assign themselves as the incident commander, or someone else, depending on the situation. The incident commander then proceeds to assigns different roles to people.
Once there are sufficient SMEs in the channel, each person with a role will start working on their assigned tasks. The incident commander may choose to assign additional tasks to a role or an ad-hoc member. Zenduty will notify the assignee whenever a task is assigned to them.
Stakeholders can follow the progress of major incidents in their pwn dedicated channels and Zenduty will auto-update the incident message as new incident updates keep pouring in.
Once an incident is completely resolved, Zenduty will automatically archive the channel and close any associated Jira issues or Statuspage incidents.
Best Practices for Incident Management
The following are the best practices as prescribed by Google in their Google SRE book(highly recommended).
Prioritize: Stop the bleeding, restore service, and preserve the evidence for root-causing.
Prepare: Develop and document your incident management procedures in advance, in consultation with incident participants.
Trust: Give full autonomy within the assigned role to all incident participants.
Introspect: Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.
Consider alternatives: Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.
Practice: Use the process routinely so it becomes second nature.
Change it around: Were you the incident commander the last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.
Here’s a video showing the entire incident lifecycle with Zenduty ICS in action.