15 months of 24x7 Primary On-Call — Here’s How I Survived


  • Airflow: “cron” solution
  • Redshift: AWS Hosted Data Warehouse
  • DWH-01: Our EC2 “pet” server

Actionable Metrics

  • Low priority alerts → Slack
  • High Priority Alerts → Slack & Pagerduty

Symptoms Not Causes

  • Error rates — Executions are encountering anomalous error rates.
  • Latency Distributions — Requests are encountering anomalous latencies.
  • Request / Execution Count Success Ratio — The system is not performing work or erroring on doing work.
  • Queue depth — The system is unable to keep up with requested work.

Ratios Rule — But Be Careful

Emulate The Customer Experience: Probes Probes Probes Probes Probes :infinity:

  • Make requests to the system in the same manner any other client would (i.e. HTTP request to an HTTP server, SQL requests, etc.
  • Emit a metric of the result success|failure
  • Redshift: Every minute a probe connect to a redshift cluster and ensure we can execute a SQL query.
  • Airflow: Every minute a probe invokes a job on airflow and ensures that it completes successfully.
  • Freshness: Every hour we check the latest record in our critical database tables (MAX(created_at)) and ensure that it is within some threshold (~24 hours).

Give Yourself Room to Fail — SLO Based Alerts

  • A success ratio (successful events / total events)
  • A target expressed as a percentage (the percentage of successful events required in an interval to be considered health)






Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deploying Django Apps

Irish Numbers: An Easy Beginner’s Guide For 2021

How to get started in Python: An overview of recent trends

Build Your Own Captcha With PHP

Word every day: Easiest

Git Detached Head y como solucionarlo

The Ultimate Project Manager Checklist for FinTech p.5

How StartUp can keep the initial development cost low while creating a system that can be scaled…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store



More from Medium

Migrating Legacy APP to AWS

Multi-Cloud Monitoring at Simplilearn — Part II

Migration to AWS — Hidden costs and how to eliminate them

An Open Source High-Performance Aws Kubernetes Cluster