My son’s swim meet is coming up. If he’s healthy, he’ll compete.
My daughter’s hockey team has a game coming up. League rules say they must forfeit if they have fewer than 10 players. There’s only 10 girls on the roster so they need everyone to be healthy to compete.
In either situation, a single failure will fail the lot. With swimming there’s just one chance to get unlucky, whereas with hockey there’s ten. I’ve been thinking of these ‘opportunities to get unlucky’ as bad luck tickets.
I see bad luck tickets everywhere.
When I was working on smaller teams, the leadership was very open and transparent with their plans and decisions. On larger teams, internal news often leaks externally! Share your plans with 1,000 people and you’ve got 1,000 chances to get unlucky.
I was impressed – but not surprised – by this year’s xz supply chain attack. If our software depends on 10 open source libraries, that’s 10 chances to be compromised by a rogue maintainer. Depend on 1,000 libraries and we’ve got 1,000 bad luck tickets.
One of the backend services at my job does fan-out queries on its sharded database. Suppose you’d like to know how much money it transferred yesterday:
SELECT SUM(amount_cents)
FROM transactions
WHERE DATE(sent_at) = DATE('2024-11-11');
On a database with 64 shards, the query is executed on each shard and then aggregated. Each shard has an opportunity to fail.
The math is weird.
Our database executes fan-out queries in parallel. That’s rad! But the latency of the aggregate query is the maximum latency of the 64 individual shard queries. If one shard is hot because the traffic isn’t balanced, that’ll degrade all of the aggregate queries.
Suppose an action fails on 1 in 1,000,000 attempts. If we attempt it 1,000 times, there’s a 0.1% chance of at least one failure within those attempts¹.
Failures will probably correlate (and the successes will too). Hockey players don’t get sick independently! They exhaust themselves at the same tournaments. And they share their water bottles even though I keep telling them it’s gross and they should cut it out.
Limit your bad luck.
It’s difficult to build robust systems with fallible parts, but fallible parts is all there is. I aspire to do more with less.