AWS Lambda — Debugging File Descriptor Exhaustion Using Go

TL;DR I hit the 1024 File descriptor limit in lambda. AWS has a bug handling this case, in which END and REPORT logs were omitted. The fixed required bounding File Descriptor usage by configuring the go http client (*http.Client).

Background

I recently worked on a project to extend application log retention and reduce costs. Part of the solution involved aggregating raw application log files from S3. These log files range in size from range from 100KB → ~1MB and numbered ~30,000 files for an hour. A go program was created to aggregate the raw log files into 32MB files. The program was deployed as an AWS lambda and executed hourly. The lambda takes a “raw” s3 directory as input, and downloads all files in that directory buffers them to 32MB, and writes them out to an output directory whenever files reach 32MB:

s3://application-logs/raw/.../YYYYmmdd/HH/<- Lambda ->s3://application-logs/aggregated/.../YYYYmmdd/HH/

A Strange Occurrence

The first thing I did after seeing the failed invocations was to check the logs. A number of codepaths in the rollup script intentionally panics to trigger a lambda retry. Each lambda invocation is idempotent which simplifies the lambda code and allows for ignoring complicated error handling, state management, and edge cases. It was immediately obvious something strange was going on. Every healthy AWS Lambda invocation results in 3 log statements: START, END and REPORT which look like:

START RequestId: c793869b-ee49-115b-a5b6-4fd21e8dedac Version: $LATEST 2019-06-07T19:11:20.562Z c793869b-ee49-115b-a5b6-4fd21e8dedac...END RequestId: c793869b-ee49-115b-a5b6-4fd21e8dedacREPORT RequestId: c793869b-ee49-115b-a5b6-4fd21e8dedac  Duration: 128.83 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 74 MB  Init Duration: 166.62 ms    XRAY TraceId: 1-5d9d007f-0a8c7fd02xmpl480aed55ef0   SegmentId: 3d752xmpl1bbe37e Sampled: true

AWS Support

The next thing I did was file a ticket with AWS support. The AWS documentation states that all lambda invocations should contain the START, REPORT, and END log lines. AWS support confirmed the missing logs were unexpected and a bug on Amazon’s end.

Ensuring the Rollup Script Worked

I spent a good amount of time ensuring the correctness of the rollup lambda. The lambda is written in Go, and Go is a concurrent language. A side effect of concurrency is the ability to right concurrently incorrect code. The concurrent correctness was prioritized during development by:

  • Share memory by communicating… through concurrency safe channels
  • Auditing all dependencies to ensure that shared data is safe for concurrent use according to docs. This means finding all documentation for all concurrently accessed dependencies and verifying that they had statements like: Sessions are safe to use concurrently as long as the Session is not being modified.(AWS go sdk session documentation)
  • Use the go -race detector during development
  • Byte by byte input and output comparisons — concurrency errors would likely result in slightly off outputs from data being overwritten or ignored. Combined with the above strategies, the fact that input and output was correct under very high concurrency and soak testing (long execution times) was a strong indicator of no race conditions
  • Executing the program locally many times under high concurrency

Moving Forward

My primary focus was delivering this pipeline according to my commitment. Even though the script was erroring intermittently (a couple of times a day across all environments — ~98% reliability) I set up a dead letter queue which retried the lambda until it succeeded; ensuring that the pipeline was “done” and
the commitment was fulfilled. Even though the pipeline could process data reliably (with some retries…) it didn’t sit well that the lambda was erroring in such a strange way…

Starting to Debug

The first step I took was to remove any shared state between lambda invocations. AWS caches lambda execution environments in order to reduce lambda execution time by removing startup times. Lambdas have two types of resources:

Back to Basics

I was pretty confident in the correctness of the rollup lambda but errors were still present. Even though I had been using AWS lambda for almost 4 months, I had never actually read the official full lambda documentation from top to bottom:

Verifying Hypothesis

In order to verify the hypothesis the lambda was setup to monitor how many file descriptors it was using. I used prometheus’ procfs project to query this information from the OS. The proc filesystem is not available in MacOS so the program had to be executed in linux. This required:

  • Building the go binary for linux
  • Copying the binary to an ubuntu docker container with gatekeeper generated AWS credentials (since the program makes requests to our S3 buckets)
  • Invoke the program against production workloads
done := make(chan int)go func() {  proc, err := procfs.NewProc(os.Getpid())  if err != nil {    panic(err)  }  t := time.NewTicker(5 * time.Second)  for {    select {      case <-t.C:        fmt.Printf("numGoroutines: %d\n", runtime.NumGoroutine())        numFds, _ := proc.FileDescriptorsLen()        fmt.Printf("num fds: %d\n", numFds)      case <-done:        fmt.Println("done status")        return     }  }}()
numGoroutines: 3683num fds: 1715numGoroutines: 3021num fds: 1386numGoroutines: 2339num fds: 1035

Bounding Resource Usage

The only file descriptors that the program used were TCP connections related to S3 connections. During the initial investigation I noticed that all the *http.Clients were using the default client (S3 Downloader, S3 Uploader and Datadog go library). I updated the lambda to provide Amazon recommended defaults, with the addition of specifying a maximum number of allowed connections per host (MaxConsPerHost):

HTTPClientSettings{Connect:          5 * time.Second,ExpectContinue:   1 * time.Second,IdleConn:         90 * time.Second,ConnKeepAlive:    30 * time.Second,MaxAllIdleConns:  100,MaxHostIdleConns: 10,ResponseHeader:   5 * time.Second,TLSHandshake:     5 * time.Second,MaxConnsPerHost:  250,},
sess := session.Must(  session.NewSession(    &aws.Config{      HTTPClient: logs.NewDefaultHTTPClientWithSettings(),    },  ),),
numGoroutines: 761num fds: 256

Error Free!

The code was deployed to production during Thursday May 14th afternoon, and have not seen an error since!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store