PUBLIC OBJECT

Preparing for Network Failures this Holiday Season

Suppose I’m connecting my home Christmas lights to the Internet. Perhaps I’ll make a mobile app that calls my home control server via an HTTP API:

POST /lights/toggle HTTP/1.1

{
  "subjects": ["maple_tree", "roof"]
}
HTTP/1.1 200 OK

{
  "toggled": true
}

It works. I can finally live the dad dream of flashing the lights on and off when my sports team scores.

But this API? It sucks. If the network call fails, I won’t know what state the lights end up in. I might inadvertently leave the lights off and deprive my neighbours of holiday cheer.

Mitigation 1: Check The Network First

One tempting mitigation is to check my phone’s connectivity before calling the toggle API. Perhaps I’ll use Android’s ConnectivityManager to only toggle the lights once I’ve confirmed that the device is connected to the Internet.

But what does it mean to be connected to the Internet ? Not very much! On Android phones it probably means that a recent request to https://google.com/ resulted in a successful response.

  • It doesn’t mean that a new request to https://google.com/ will succeed. I could have since stepped into a radio-blocking elevator and failed the request at the client.
  • It doesn’t mean that a new request will reach the server. Each network call involves a sequence of fallible ISPs, DNS servers, gateways, and routers. If any of these are offline or overloaded the call fails on the network.
  • It doesn’t mean that the server will successfully toggle the lights. My server code might have a bug that toggles the smart locks instead! Or perhaps it’s busy transcoding Bluey and will time out before anything is returned.

Even a live TCP connection to a server is precarious: the client, network, and server can all fail without warning. Any strategy that checks connectivity first is fragile.

Mitigation 2: Idempotence Token

From our vantage point on the client, we can’t differentiate between:

  • The toggle request didn’t reach the server
  • The server’s response didn’t reach the client

But if the server can discard duplicate calls, the client can retry until it receives a positive response.

POST /lights/toggle HTTP/1.1

{
  "request_id": "a7240db9800efaa68251f797094a208d",
  "subjects": ["maple_tree", "roof"]
}

This strategy is excellent and I recommend it everywhere. Here’s the recipe I follow:

  1. The client generates a universally-unique ID, such as UUID.randomUUID().
  2. The client includes that ID in every attempt. (If the request is stored in an on-device queue, the generated ID should be stored with it!)
  3. The server maintains a unique index of request IDs. Keeping these forever is easy and good enough for most applications.
  4. The server only performs the operation the first time it sees its request ID.
  5. The server returns the original response if a repeated request ID is received.

Mitigation 3: Idempotent Endpoint

We don’t need an idempotence token if we can make an idempotent API. ‘Turn on the lights’ is idempotent: doing it 5 times is the same as doing it once.

We can make our API idempotent:

POST /lights/set HTTP/1.1

{
  "on": true,
  "subjects": ["maple_tree", "roof"]
}

Version numbers offer another way to make other APIs idempotent:

POST /lights/toggle HTTP/1.1

{
  "require_version": 99,
  "subjects": ["maple_tree", "roof"]
}
HTTP/1.1 200 OK

{
  "toggled": true,
  "result_version": 100
}

OkHttp Does Retries

This topic is dear to me 'cause OkHttp automatically retries when a pooled TCP connection cannot be reused. This is scary behaviour! I don’t want anybody to accidentally buy two Christmas trees ’cause an HTTP request was repeated.

Be idempotent
Skip connectivity checks
Just one Christmas Tree