Large data deployments often contain multiple levels of derived tables. Understanding lineage is critical for resolving data downtime and bugs. Often table hierarchies grow out of control with time, which makes maintaining documentation and debugging data issues time intensive and costly. This post describes a method for deriving data lineage from inspecting SQL statements. The database provides a centralized location where many different teams and ETL jobs come together, which makes it a great candidate for generating metadata like Lineage.
Database deployments often combine tables to create new tables. Lineage describes how a table is derived, i.e. which tables it…
TL;DR I hit the 1024 File descriptor limit in lambda. AWS has a bug handling this case, in which END and REPORT logs were omitted. The fixed required bounding File Descriptor usage by configuring the go http client (*http.Client).
I recently worked on a project to extend application log retention and reduce costs. Part of the solution involved aggregating raw application log files from S3. These log files range in size from range from 100KB → ~1MB and numbered ~30,000 files for an hour. A go program was created to aggregate the raw log files into 32MB files. The program…
Views are an important tool in legacy warehouse environments. Views act as an interface which decouples clients (like tableau, looker, applications, etc) from underlying data sources. Enabling backwards compatible changes is especially important in legacy environments, where the full scope of clients may be unknown. This article explains how views can be used to enable backwards compatible data migrations, which avoid the need to update any client queries.
Legacy warehouse environments for mid or large size companies may have 10’s or even 100’s of distinct clients. Clients can be:
Snowflake does not currently support sub query pruning which can have serious cost implications for DBT incremental updates against external tables. Careful care must be taken When using DBT incremental models to query against large partitioned external tables. This post shows how to address the incremental update by breaking the incremental query out of a subquery.
DBT Incremental models load data gradually. Each run focuses on a limited (i.e. incremental) dataset, opposed to a full data set. This requires using a query predicate to limit the dataset. This predicate is often based on event time. …
Probing is a technique to perform regular checks on a service using a short interval. Probes provide signals that can significantly cut down debug time. This post describe probes and how they can be used to drill down into errors and make debugging more focused; how they can partition the debug space.
Probes are targeted checks, performed as request / response actions, on a short (~1 minute or less) interval. Some common applications of probes are:
Successfully changing systems requires an understanding of the current system’s state. Profiling is a tool for understanding systems at a point in time. Without a good understanding of the current state, changes can be suboptimal, counter productive, or even dangerous. Profiling is used to breakdown a system’s current state using dimensions, and is a prerequisite for successfully modifying systems.
Profiling describes the current state of a system. Knowing the current state helps to inform changes. Consider the following goals and how profiling helps each.
JSON is the de-facto logging standard. JSON is so ubiquitous that the popular logging data tools (such as Elasticsearch) accept JSON by default. Although JSON is an evolution over previous logging standards, JSON’s lack of strict types make it insufficient to use for long-term persistence or as a foundation for a data-lake. This post describes the problem with JSON and proposes a solution using a strictly typed interchange format such as Protocol Buffers.
JSON logs establish an explicit structure. JSON parsers are available in most languages which make it accessible as a log standard. JSON logs are referred to as…
Alerting on SLOs is an SRE practice which enables teams to proactively be notified when a level of service is not being met. When an SLO alert fires teams can be confident that a client is impacted. Alternative alerting techniques have difficulty quantifying customer impact, which can complicate incident response. This post describes SLO alerts and the benefits they provide over alternatives. The Google SRE book describes how to alert on SLO’s, and this post aims to describe why to alert on SLOs.
SLOs are a quantifiable target representing a client’s experience. SLOs are built on SLIs, and SLIs are…
Since February I’ve been working on an asynchronous AWS Lambda service processing 60,000 events / second from Kinesis. Lambda provides minimal operational overhead, fast deploys, predictable pricing, and enforces many tenants of 12-factor apps, out of thee box. After working with Lambda almost daily for the past 3 months, I’m convinced it’s the future for asynchronous processing, and here’s why.
Lambda removes complexity over stateful deployment models. This minimizes the surface area service engineers are responsible for. Lambda removes the need for system alerts, instance sizing, autoscaling, provisioning servers, and choosing process monitors, amongst others. There’s no SSH’ing into instances…
AWS Lambda is a serverless solution which enables engineers to deploy single functions. AWS Lambda handles orchestrating, executing, scaling the function invocations. It’s important to structure go lambda projects so that the lambda is a simple entry point into the application, equivalent to
cmd/. After a project is structured, it important to keep logic outside the lambda, which allows for easy reuse and testing of the application logic. The following are a series of steps which can be used in Go based lambda projects to help keep projects structured and increase the testability of lambda-based projects.