Many alerts place an unnecessary burden on Ops teams instead of helping them solve issues. This presentation describes the phenomenon and four ways to address it.
4. They point to issues that don’t require
a response
They lack critical information, forcing
you to spend time searching for more
insights in order to gauge their urgency
5. An excess of non-actionable
alerts creates “alert fatigue”,
wasting time and resources and
interfering with the real issues
at hand
7. Do you receive redundant alerts and:
Immediately ignore them?
Realize they aren’t relevant to you?
Perform the same routine actions for
obtaining the actual information you need?
10. 1. Unhelpful titles
The problem:
One of the most important parts of the alert is its title, as it is the first thing
you see.
Cryptic titles force the responders to dig unnecessarily through the body of
the alert for more info.
Extra frustration occurs when different alerts share similar titles, causing
great confusion and wasting time.
11. 1. Unhelpful titles
Example:
You receive an alert titled “CPU LOAD 1.80″ followed by another alert titled
“CPU LOAD 1.90”.
Are these alerts even referring to the same server? Is a 1.80 load critical?
What is affected by this problem?
Wouldn’t it been great if the alert provided answers rather than adding
more questions?
12. 1. Unhelpful titles
Making it actionable:
All alerts should have short yet descriptive titles.
They should enable the responder, at a glance, to know what the problem
is, where it is, and how to address it.
For example: “Server billing-1 load is critical for 5 min” is much more
actionable than “CPU LOAD 1.80”.
13. 2. Lack of vital information
The Problem:
Alert content is often limited or cryptic, forcing us to spend a lot of cycles
understanding the meaning of the alert and searching for more
information in order to gain insight.
Somewhere within my Nagios, Graphite, Pingdom, or New Relic, there is
relevant information to be found, but instead of solving the issue a
significant portion of my valuable time is spent on such searches.
14. 2. Lack of vital information
Example:
When addressing an alert about a server overload, almost always the same
set of tasks are performed.
These include connecting to the server to check for current load or
analyzing trends in the CPU graph.
Moreover, the next time a similar alert happens, you’ll be performing
these same steps over and over.
15. 2. Lack of vital information
Making it actionable:
Identify alerts that require repetitive and predictable searches for more
information
Automatically bundle that information as part of the alert.
list actions that need to be performed or a link to relevant resources
such as scripts, protocols or the developer’s insight into why this might
happen
16. 3. Alerts that don’t require resolution
The Problem:
Production environments are complex and dynamic.
To maintain reliability, vital system information must be accessible to Ops
and Developers.
Our instinct tells us that this can only be accomplished by being notified of
every alert and exception.
In reality, however, the large majority of these alerts don’t require an
action and end up drowning out the ones who do.
17. 3. Alerts that don’t require resolution
Example:
An alert could’ve been sent to indicate that a user entered an invalid credit
card number.
While this information may be very interesting, we do not have any control
over the user’s actions and can therefore do nothing about it.
Getting this alert will only add additional noise.
18. 3. Alerts that don’t require resolution
Making it actionable:
If the alert doesn’t lead to an immediate action on your part,
don’t send it.
Instead, find the issues which will require your attention.
For example, replace the invalid credit card alert with an actionable alert
which specifies that the rate of checkouts has dropped dramatically —
maybe a change was made and a rollback action is required.
Another solution can be a daily / weekly report which aggregates and
visualizes the information that isn’t required in real-time.
This way, the desired information will be available at the right time.
19. 4. Alert routing
The Problem:
In many organizations, everyone receives all the alerts.
This type of practice is usually initiated when teams are small and everyone
is involved in everything.
However, as teams scale and people begin to specialize, the “loudspeaker”
approach to alerting quickly becomes a drag.
20. 4. Alert routing
Example:
Sending alerts regarding connection issues with your 3rd party billing
provider to your DBA team won’t help resolve the alert and will probably
be ignored.
21. 4. Alert routing
Making it actionable:
Send alerts only to people who are relevant to that alert.
Obviously, this is easier said than done, as many alerts can be caused by
several different sources.
In such cases, creating more specific alerts for each source will provide the
necessary granularity to make better routing decisions.
22. Conclusion
Making alerts more actionable can significantly ease your pain
and improve the day to day work.
Simple changes, can have a dramatic impact.
23. Conclusion
Actionable alerts can become irrelevant very quickly.
Have a culture of ongoing improvement to your alerts
Make a habit of periodically reviewing them and removing
the non-actionable ones.