SRE: Alerting on SLOs

Terminology Refresher

SLOs are a quantifiable target representing a client’s experience. SLOs are built on SLIs, and SLIs are ratios. The numerator counts “successful” events and the denominator are total events. There are two common types of SLIS:

  • Availability
  • Latency
sum(rate(http_requests_total{status!~"5.."})) 
/
sum(rate(http_requests_total{}))
sum(rate(http_request_duration_seconds_bucket{le="0.1"}))
/
sum(rate(http_request_duration_seconds_count))

Client Experience

Good SLIs represent a client’s experience. This means that SLO alerts are able to monitor client experience in real time. When an SLO alert triggers, there is a high level of confidence that clients are being affected. SLO alerts are a subset of symptom based alerts. Symptom based alerts focus on the behavior of systems and not the causes. Alerting on high resource usage or low disk space are two common examples of cause based alerts. When cause-based alerts trigger there’s little context into the effects, they provide no context to help answer — Are client’s affected? The first step in responding to cause based alerts is investigation, instead of remediation. The responder has to research and understand the affect of the resource. The first step in responding to an SLO alert is remediation. SLO alerts remove ambiguity from the incident response process.

Objective Quantities

SLO alerts quantify impact to clients: when an SLO-alert fires the responder knows that a client is impacted. Not only do SLO alerts indicate that client’s are affected, they also indicate how many requests are affected. Consider an operation with a 99% availability SLO. If an alert triggers and the current performance is 98% it’s know that 2% of requests (2 / 100) are failing. Alerting on objective measurements quantifies the impact of incidents. A 50% failure rate may warrant a different response than a 0.1% error rate.

Call to Action

When an SLO alert triggers a responder knows that clients are being affected and the degree to which clients are affected. SLO alerts are highly actionable. They tell the responder clients are being affected and then quantify the impact of the effect in terms of number of requests. Since SLOs represent a level of service, an alert represents that level of service is not being met. The alert triggers a call to action: an error budget is exhausted, a certain level of service was committed to, and it needs to be addressed immediately, without any ambiguity nor investigation.

Generic Tooling

SLO alerts benefit from generic tooling. Since SLOs are ratios ad generate values between 0 and 100 it’s easy to build tooling around these. A user can specify a numerator metric a denominator metric and a target level of service and then automatically generate alerts based on one of Google’s SLO alert strategies. Tooling removes moving parts and one-off alerting solutions and strategies. Instead of one team using threshold based alerts and another using a different strategy, the alert logic can be codified in the tooling. Google also outlines a way to bucket alert levels which provides a default alerting strategy. Using the bucketed approach services can set a tag representing a bucket of service such as CRITICAL or LOW and it will automatically be monitored and alerted on. This is possible through generic tooling and expressing each SLO as a ratio.

Conclusion

SLO alerts enable teams to monitor their commitments to their customers. When an SLO alert triggers responders know that clients are being affected and can quantify the impact. SLO alerts are not trivial to implement but are great candidates for automation. This means they can be implemented a single time and then reused without reimplementing the alerts each time.

Resources:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store