Incident Alert Routing — Reducing Alert Noise
Our monitoring tools are the sentinels of our systems. They scan every aspect of the health of our systems, and alert you whenever anything goes awry. As sysadmins, it is expremely important to design your alerting infrastructure in a way that maximizes your signal to noise ratio.
To borrow a quote from the great Captain Edmund Blackadder, finding a root cause of an incident is like looking for a tiny piece of hay within a stack full of needles. Alert noise from a badly configured system can kill your company in many ways. To begin with, the noisier the system, the higher the probability of missing a critical alert. Second, noisy alerts can result in serious mental and physical health issues within your operations teams, and consequently, higher churn and loss of in-house expertise. It is therefore important that you constantly tweak your systems to weed out alerts that are not important or actionable. Here are three ways:
Grouping alerts by component
Most monitoring tools provide a range of metrics to alert on whenever the metric crosses a threshold. Instead of sending one alert per metric, identify a metric that can point to the root cause instead of symptomatic metrics. A good approach would be to group a set of metrics together and fire an alert if a component is down.
Analyze, analyze, analyze
Alert noise reduction is a continuous process. One must always look at the entire alert data at the end of every month and segment them into actionable/important alerts and suppressable alerts. Analyze correlations between different kinds of alerts, look at alerts that correlates with multiple alerts and pick an indicative metric to alert on.
If your monitoring tools have logical operators that can act on variables in real time, now is the time to put them to work. Instead of sending an alert for everytime a threshold crosses 90%, send an alert if the 90% threshold was crossed 50% of the time in the last 15 minutes. If your system generates multiple alerts for the same root cause, use multiple variables for threshold breaches within a set time-window(depending on your SLAs) - if metric x,y and z have crossed their thresholds for 50% of the time in the last 15 minutes, raise an alert.
We recently shipped Zenduty Alert Rules, which allow you to fine-tune your incident alerting at many levels and make sure that the right team members are alerted when specific conditions are met, add notes and tags, route to a different escalation policy than the service default policy, customize the incident title and summary, and most importantly suppress the incident.
If you’re looking for an end-to-end incident management platform with awesome alert routing and response management capabilities, do give Zenduty a spin and leave us your feedback in the comments below.
And finally, be zen!