I’ve been working on a product that’s grown from 5 developers in 2013 to 400 developers today. Our product complexity and customer numbers have grown similarly.
I’m lucky and thankful to have been on a team whose hard work has paid off so much. But I’m also frustrated to watch my beloved code stretched beyond its capacity and become a liability! Our monolithic service used to be majestic but now it’s big and slow.
I wonder about what we could have done differently; to better anticipate growth and to better design for it. But in exploring this I’ve realized that growth is discontinuous. I’d like to explain this with an analogy to restaurants...
From Neighbourhood Café to Starbucks
Suppose I’m going to launch a small café. I source coffee beans & baked goods from the grocery store and farmer’s market. I hire friends & acquaintances. It’s a difficult business to run because as owner/operator I need to do everything: hiring, accounting, scheduling, facilities, marketing, and of course barista-ing! But the processes are straightforward: because I’m the only one in charge, I make all the decisions and don’t need many meetings.
Business is good and I decide to launch more cafés across town. Then more across the province, and within a decade I’m thinking about opening a location across the border. This business is not like the original café. There’s a centralized facility for roasting coffee beans. There’s a comprehensive hiring process and an HR team. There’s a shift-scheduling system and an IT team to manage it. There’s property managers, marketers, and a barista trainers.
In the ramp up from single-location café to international franchise, there’s necessary discontinuities. When do we hire the first sysadmin? The first CTO?
From Tech Demo to Critical Infrastructure
We can make the same comparison between a hackathon app and the industry leading ecosystem that it aspires to grow into. The big system needs stuff that the small system doesn’t.
A small system can centralize all of its data in a single SQL database. Big systems need to duplicate data across different SQL and K/V databases to accommodate for different availability needs, use cases, access patterns, and permission requirements.
A small system can tolerate total failure. If you deploy bad code, the whole system can go dark and that’s okay. Big systems need to isolate failures and recover gracefully.
A small system can have unsophisticated security. Maybe admins can connect to the production database and run queries? But big systems need to defend against untrustworthy employees. They need granular permissions groups, role management, audit logs, and policies to help set them.
In the early days of my product we’d occasionally need to make a technical decision that had consequences on long term growth. Here’s a fictional but representative discussion:
‘Each Aurora DB can store up to 128 terabytes of data. So if we store 4 megabytes per customer, we’ll be out of space at 33 million customers...’
‘Wouldn’t that be a nice problem to have!’
It turns out we were kind of right.