Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely.

At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect. We share this 50% goal because toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time. The work of reducing toil and scaling up services is the “Engineering” in Site Reliability Engineering. Engineering work is what enables the SRE organization to scale up sublinearly with service size and to manage services more efficiently than either a pure Dev team or a pure Ops team.

One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise. An alert, for a particular engineer, can be considered as noisy if the alert does not provide any information about an event that has already happened, or an event that is about to happen, that significantly affects the end-user or customer, and the alert is not actionable on the part of the said engineer. Alert noise, not only distracts your engineers from their engineering duties but also causes needless stress, increases anxiety and exacerbates mental and physical health issues that your engineers may already be facing. It will come as no surprise that organizations with the noisiest alerts also have the highest engineering churn rates in the industry.

Every Monitoring or IT management tool sends a lot of alerts daily. While some alerts are important and actionable, most of them are noise. Nobody likes to wake up to an alert at 2 AM in the morning unless it’s a major incident or an outright catastrophe. We took a close look at the incident management services and how they help teams reduce alert noise. We were surprised to see that all of them have implemented just bare-bones features to mitigate alert noise, and despite being in the market for over 7–10 years, have failed to see alert noise reduction as something that needed to be solved in a meaningful way.

How Zenduty can help you shrink a river of alerts to a trickle

When we started building Zenduty, solving the problem of alert noise was top on our list of priorities. And so we did, with Alert Rules. Following are the criterions on which you can route your alerts on Zenduty along with operations like <, ≤, >, ≥, ==, !=, contains(substring), not contains, regex match, is empty, is not empty, in(multi-select), not-in(multi-select), between and not between:

  1. Payload Search — Every alert from a monitoring integration sends its own custom payload along with the alert parameters. For example, an critical alert from Jenkins will also contain a payload like — {“job”: “build-prod”, “job-number”:5, “status”: “SUCCESS”}. You can run a key search on the alert payload and use the above operators on the value
  2. Incident Message and Incident Summary— Route if you find a keyword in the alert message or summary
  3. Alert Time of the day —Route depending on the time of the day
  4. Alert Day of the week — Route depending on the day of the week
  5. Alert Date — Route if the incident falls on or between specific dates
  6. Alert Type — Route depending on the type of the alert(critical, error, warning, acknowledged, resolved and info)

Following are the actions you can take if the alert routing conditions are met:

  1. Suppress — Suppress the incident
  2. Change Alert Type to —Let’s say the alert type of the incoming alert is critical(which will create an incident), and you do not want the alert to create an incident, you can change the alert type to “info”.
  3. Add Note — Automatically add a not to the resultant incident that will help the on-call engineers and incident commander to determine the RCA and triaging steps
  4. Route to Escalation Policy — Route to an escalation policy that is not the default on the service affected. This may be useful in cases where you might need a more aggressive or less aggressive escalation depending on the type of alert that has come in.
  5. Assign User — Assign the incident to a specific user. This is useful where a known component is affected and the fastest way to resolve the incident is by assigning it to a known subject matter expert
  6. Change Incident Urgency — change the urgency of the incident. If the alert creates a low priority incident and the your conditions detect a potentially high priority incident, you can conditionally change the priority to high and vice-versa.
  7. Change Alert Message — create your own custom Incident Title from the alert payload
  8. Change Alert Summary — create your own custom Incident Summary from the alert payload
  9. Change Alert Entity ID — change the alert entity_id to implement custom deduplication
  10. Assign Role to User — after the alert creates an incident, assign a particular role to a particular user
  11. Assign incident tag — automatically add a tag to the newly created incident depending on the matching condition. For example, if a backend component is affected, add a tag “backend”
  12. Add incident task and template(coming soon) — Automatically add an incident task template or a task to the incident

The Zenduty Alert Rules allow you to fine-tune your incident alerting at many levels and make sure that the right team members are alerted when specific conditions are met, add notes and tags, route to a different escalation policy than the service default policy, customize the incident title and summary, and most importantly suppress the incident.

If you’re looking for an end-to-end incident management platform with awesome alert routing and response management capabilities, do give Zenduty a spin and leave us your feedback in the comments below.

And finally, be zen!