Optimizing ‘new byte[]’

How much time does it take to allocate a byte array? Let’s write a JMH benchmark:

  @Param({"1000", "10000", "100000"})
  int size;

  @Benchmark
  public void newByteArray() {
    byte[] bytes = new byte[size];
    bytes[bytes.length - 1] = 'a';
  }

Running this on my laptop shows that time scales with size:

       size    ns/new    ns/100 K
       1000        83        8300     
     10,000       546        5460   
    100,000      4659        4659

Doing napkin math I estimate that ‘new’ costs ~37 ns and each 100,000 bytes costs ~4,622 ns. Where does this time go?

Memory management: object allocation and garbage collection.
Zero fill: When you allocate memory on the JVM, it’s delivered fresh and new, filled with zeros. Filling reclaimed memory with zeros takes time!

Zero Zero Fill

Okio’s main feature is its Buffer class. It implements a dynamically-sized sequence of bytes that you can write to and read from. Coding with Buffer feels fast & lightweight, just like ArrayList.

Okio’s Buffer is implemented using a linked list of segments, each of which has an 8 KiB byte array. When you create a buffer it has 0 segments and when you write it adds segments to the end. When you read it removes segments from the front.

To make its buffers faster Okio reuses segments. This saves both the allocation cost and also the zero-fill cost. But it adds a pooling cost: getting, putting, and limiting the pool’s memory use.

Lock Free

Through Okio 2.6, Okio used synchronized to guard access to the pool. I like synchronized because the syntax is nice, it’s easy to understand, and it performs fine.

But synchronized is bad when there’s very high-contention. Lots of threads fighting for the same lock means lots of context switching. In Okio 2.7 we use AtomicReference instead. The code’s bigger and less obvious but it runs better when lots of threads use buffers at the same time.

Striped

Striping is the concurrency world’s sharding. Take a highly-contended resource, create several instances of it, and distribute its users evenly over the instances.

We scaled up Okio’s segment pool by the number of processors on the host computer. My laptop has 8 physical cores and each has 2 hyperthreads, so it gets 16 segment pools. The pools have 1 MiB of combined capacity.

One potential benefit of striping is increased probability that a CPU core gets its previous segments back when it fetches from the pool. As implemented all segments are symmetric but recently-accessed memory is more likely to be in a CPU-adjacent cache.

Comparision

This benchmark is designed to create extremely high contention:

            pool      size        speedup
            none     0 KiB           1.0x
    synchronized    64 KiB           1.3x
        lockfree    64 KiB           2.0x
striped lockfree    64 KiB/core      5.8x

Under very high contention the synchronized pool is barely useful! And the striped pool is surprisingly great. I confirmed that the extra size wasn’t the difference here: the lockfree pool isn’t faster with a larger pool.

Upgrade Your Okio

These performance improvements are yours in Okio 2.7.0. Make sure to upgrade.