How much time does it take to allocate a byte array? Let’s write a JMH benchmark:
@Param({"1000", "10000", "100000"})
int size;
@Benchmark
public void newByteArray() {
byte[] bytes = new byte[size];
bytes[bytes.length - 1] = 'a';
}
Running this on my laptop shows that time scales with size:
size ns/new ns/100 K
1000 83 8300
10,000 546 5460
100,000 4659 4659
Doing napkin math I estimate that ‘new’ costs ~37 ns and each 100,000 bytes costs ~4,622 ns. Where does this time go?
- Memory management: object allocation and garbage collection.
- Zero fill: When you allocate memory on the JVM, it’s delivered fresh and new, filled with zeros. Filling reclaimed memory with zeros takes time!
Zero Zero Fill
Okio’s main feature is its Buffer class. It implements a dynamically-sized sequence of bytes that you can write to and read from. Coding with Buffer
feels fast & lightweight, just like ArrayList
.
Okio’s Buffer is implemented using a linked list of segments, each of which has an 8 KiB byte array. When you create a buffer it has 0 segments and when you write it adds segments to the end. When you read it removes segments from the front.
To make its buffers faster Okio reuses segments. This saves both the allocation cost and also the zero-fill cost. But it adds a pooling cost: getting, putting, and limiting the pool’s memory use.
Lock Free
Through Okio 2.6, Okio used synchronized
to guard access to the pool. I like synchronized
because the syntax is nice, it’s easy to understand, and it performs fine.
But synchronized
is bad when there’s very high-contention. Lots of threads fighting for the same lock means lots of context switching. In Okio 2.7 we use AtomicReference
instead. The code’s bigger and less obvious but it runs better when lots of threads use buffers at the same time.
Striped
Striping is the concurrency world’s sharding. Take a highly-contended resource, create several instances of it, and distribute its users evenly over the instances.
We scaled up Okio’s segment pool by the number of processors on the host computer. My laptop has 8 physical cores and each has 2 hyperthreads, so it gets 16 segment pools. The pools have 1 MiB of combined capacity.
One potential benefit of striping is increased probability that a CPU core gets its previous segments back when it fetches from the pool. As implemented all segments are symmetric but recently-accessed memory is more likely to be in a CPU-adjacent cache.
Comparision
This benchmark is designed to create extremely high contention:
pool size speedup
none 0 KiB 1.0x
synchronized 64 KiB 1.3x
lockfree 64 KiB 2.0x
striped lockfree 64 KiB/core 5.8x
Under very high contention the synchronized pool is barely useful! And the striped pool is surprisingly great. I confirmed that the extra size wasn’t the difference here: the lockfree pool isn’t faster with a larger pool.
Upgrade Your Okio
These performance improvements are yours in Okio 2.7.0. Make sure to upgrade.