15 months of 24x7 Primary On-Call — Here’s How I Survived

Background

  • Airflow: “cron” solution
  • Redshift: AWS Hosted Data Warehouse
  • DWH-01: Our EC2 “pet” server

Actionable Metrics

  • Low priority alerts → Slack
  • High Priority Alerts → Slack & Pagerduty

Symptoms Not Causes

  • Error rates — Executions are encountering anomalous error rates.
  • Latency Distributions — Requests are encountering anomalous latencies.
  • Request / Execution Count Success Ratio — The system is not performing work or erroring on doing work.
  • Queue depth — The system is unable to keep up with requested work.

Ratios Rule — But Be Careful

Emulate The Customer Experience: Probes Probes Probes Probes Probes :infinity:

  • Make requests to the system in the same manner any other client would (i.e. HTTP request to an HTTP server, SQL requests, etc.
  • Emit a metric of the result success|failure
  • Redshift: Every minute a probe connect to a redshift cluster and ensure we can execute a SQL query.
  • Airflow: Every minute a probe invokes a job on airflow and ensures that it completes successfully.
  • Freshness: Every hour we check the latest record in our critical database tables (MAX(created_at)) and ensure that it is within some threshold (~24 hours).

Give Yourself Room to Fail — SLO Based Alerts

  • A success ratio (successful events / total events)
  • A target expressed as a percentage (the percentage of successful events required in an interval to be considered health)

Conclusion

--

--

--

https://stackoverflow.com/users/594589/dm03514

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deploying Django Apps

Irish Numbers: An Easy Beginner’s Guide For 2021

How to get started in Python: An overview of recent trends

Build Your Own Captcha With PHP

Word every day: Easiest

Git Detached Head y como solucionarlo

The Ultimate Project Manager Checklist for FinTech p.5

How StartUp can keep the initial development cost low while creating a system that can be scaled…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
dm03514

dm03514

https://stackoverflow.com/users/594589/dm03514

More from Medium

Migrating Legacy APP to AWS

Multi-Cloud Monitoring at Simplilearn — Part II

Migration to AWS — Hidden costs and how to eliminate them

An Open Source High-Performance Aws Kubernetes Cluster