Discovering Limits

All systems have limits. If I don’t design for them that doesn’t make them disappear! It just means I’ll be unprepared when they’re reached.

Most of the systems I work on accept HTTP requests, perform operations on databases to satisfy them, and format their responses. Though ‘the cloud’ papers over many of the details, each such call is still performed by a physical machine in a physical datacenter. It has finite memory, compute, storage, and bandwidth. And the company has a finite budget on how many machines we rent!

I’ve found limits in both expected places and unexpected ones.

We didn’t limit threads in our CachedThreadPool.

An HTTP request comes in and it is assigned a thread; there’s plenty! But that thread is starved of CPU time and the call is frustratingly slow.

This can create a death spiral: the caller might time out and try again. The retries increase the overall demand on the system, and further exacerbates the problem.

We used up all the available entropy.

I wasn’t aware that the entropy in /dev/random is limited. When we requested too many random numbers it ran out and SecureRandom.nextInt() started taking many seconds to complete!

We were requesting random numbers due to a surge in inbound HTTPS connections: each needs some random numbers! We fixed it by enabling unlimited entropy with the magical -Djava.security.egd=file:/dev/./urandom JVM flag. The extra dot is necessary; urandom is just as good as random.

An additional fix is to adding Conscrypt or Envoy to terminate TLS.

We outran the garbage collector running inside our SSD.

Solid state drives have their own little CPUs inside to manage their blocks of storage. There is a lot of software running inside every SSD, maintaining the illusion of a simple block store.

Each block is like a little hotel room: it is either occupied and holds application data, or unoccupied and available to be written to. Unfortunately, like a hotel room, SSD blocks need to be cleaned (ie. zeroed) between uses. The cleaner is fast and typically invisible.

Under sustained heavy writes we exhausted the SSD’s inventory of clean blocks and our write throughput dropped dramatically. Writes on a disk that was ‘60% free’ went from 500 MBPS to 50 MBPS. That was not good.

Netflix & Chill

Perhaps my favorite JVM library is Netflix Concurrency Limits. Install it in a thread pool or webserver and it automatically learns the component’s two critical characteristics:

The rate that work is produced.
The rate that the work is consumed. Exceeding the limits above causes the consumer to slow!

If ever the production rate exceeds the consumption rate, the Netflix library will fast-fail calls that should not be attempted. Rather than trying to satisfy all inbound requests and doing a bad job at all of them, it ensures that everything attempted can be completed. It turns a thrashing service into one that won’t buckle under pressure.