Vishwa Krishnakumar
Vishwa Krishnakumar Co-founder at Zenduty. Helping companies institutionalize reliability within their operations

Defining your Sev-1s

Defining your Sev-1s

One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks like. “Website doesn’t work” is certainly no enough. “Website is up but a key resource (ie CSS file) is missing, rendering the website unusable” is still not enough. “A single page on the website is 404’ing” is not a major but could be a minor incident.

Sev-1’s are incidents where level of impact is such that the company will go out of business if you don’t fix the issue soon. The first thing you need to do is define the company commitments first. Every view of the website will complete in X ms. It will return a status code in the 2XX range (or <500 or whatever). Describe whatever commitments you want to make to the customers of that website, and then measure those. Then, when any one of those is violated, count the number of times any violation happens. That count of violations then drives how severe an incident is. (ie, define SLOs, measure SLIs, examine the ratio of requests that violate one or more SLOs out of all requests received).

Depending on the site it might be necessary to slice things up more, measuring all of those things by route or service or whatever, but I think fundamentally always defining explicitly what the commitments are first–with actual numbers–is the right approach. There can always be other key metrics too beyond the basics. For example, if the site sells things, counting the value of failed transactions might be important too, so you can understand how much a problem is costing you in actual dollars. It really depends on thinking about the service being provided first and working back through what commitments you want to make to customers from there. But again, with actual numbers that are measuring the customer experience. Depending on that, you can take the most important metrics and aggregate the number of violations and then formulate your severities/priorities for different ranges.

Looking for an incident management and on-call scheduling platform?

Sign up for a 14-day free trial of Zenduty. No CC required. Implement modern incident response and on-call practices within your production operations and provide industry-leading SLAs to your customers

Sign up on Zenduty Login to Zenduty