Data Operational Maturity

Data operations is the act of keeping data pipeline operational, and the data flowing. Data operational maturity focuses on answering:

  • Is a pipeline running?
  • Is data fresh and consistent?
  • Is the data correct?

This data operational maturity model draws heavily from Site Reliability Engineering (SRE). This data operational maturity model treats data operations as a software problem, in the same way that SRE is built on treating operations as a software problem, building observability into the very foundation of data systems.

Maturity Model

I like to think of data operational maturity composed of 3 distinct levels. Each level has tradeoffs in terms of effort of implementation, fidelity of the signals it provides, and latency of the insights that it surfaces.

Level 1 — Mechanism

  • How often are jobs executing?
  • Are jobs completing successfully?
  • How long do jobs take?
  • Are jobs queueing?
  • How are underlying resources (such as CPU, memory, and/or disk) utilized?

These metrics are so fundamental Google has an explicit name for them: The Four Golden Signals. These metrics are the cornerstone of data observability maturity in the same way that they are the cornerstone of Site Reliability Engineering at Google.

Investment and effort at this level is relatively low. Many open source and cloud solutions exist to capture these metrics ( Datadog, Prometheus, etc). Instrumenting code and pipelines at this level is generally a fixed (constant time) cost. For example, many projects, such as Airflow (a popular workflow engine, think cron for connected jobs), emit metrics natively. Every job it executes surfaces metrics about that execution. Even custom frameworks or libraries only need to be instrumented a single time.

At this level an entity can:

  • Detect when failures occur, or execution SLOs, have been breached
  • Alert on issues
  • Respond to issues
  • Use signals gathered at this level to triage issues

Level 2 — Consistency

Level 2 low to medium latency on insights. Some data properties, such as a drop in stream message rate, or an increase in NULL values, can be detected immediately. Other issues, such as daily volume trends, can only be detected at the aggregation interval (per day in this case). A Level 2 entity can reliably answer questions around the data, such as:

  • How much data is flowing through the pipelines?
  • How many rows per day per table?
  • Does daily table row volume follow historical trends?
  • How recent is materialized data (freshness)?
  • What are the dependencies for a given piece of data (such as a table)?

Data observability is a market focused on Level 2 data operational maturity, and testament to the effectiveness of this level of maturity, especially for legacy or enterprise data deployments. Monte Carlo is one vendor in this space. At this level an entity can:

Level 3 — Accuracy

Level 3 can have high latency, because events need to propagate through the entire data pipeline. Accuracy has a very high cost of development and maintenance. A poorly designed solution can provide negative ROI and end up costing the company large amounts of time tending to flaky tests. A Level 3 entity can reliably answer:

  • Are key metrics calculated correctly?
  • Are core system properties, like privacy, respected?
  • Does GDPR (or equivalent) export / forget work?

A level 3 entity can verify the calculation of metrics (such as Daily Active Users) against a specification and ensure that changes to the data systems do not violate the specification. At this level an entity can:

Conclusion

Originally published at https://on-systems.tech.