Don’t Block Suspend Functions

Here’s a program that launches 3 jobs. The first runs forever and the other two exchange a value.

@Test
fun test() = runTest {
  val channel = Channel<String>()

  val deferredA = async {
    while (isActive) {
      delay(1_000)
    }
  }
  val deferredB = async {
    channel.send("hello")
  }
  val deferredC = async {
    channel.receive()
  }

  deferredB.await()
  deferredC.await()
  deferredA.cancel()
}

Once the useful jobs B and C finish, it cancels job A. The program completes in about 50 milliseconds on my laptop.

Programs like this are why I adore Kotlin’s coroutines API. I can orchestrate a lot of behaviour with a small amount of code.

Coroutines are powerful, but they aren’t magic! Jobs achieve concurrency by suspending when they aren’t executing. The whole mechanism can break down if jobs don’t suspend.

Here’s that program with delay() replaced with sleep(). The two functions have similar semantics but sleep() doesn’t suspend:

@Test
fun test() = runTest {
  val channel = Channel<String>()

  val deferredA = async {
    while (isActive) {
      Thread.sleep(1_000)
    }
  }
  val deferredB = async {
    channel.send("hello")
  }
  val deferredC = async {
    channel.receive()
  }

  deferredB.await()
  deferredC.await()
  deferredA.cancel()
}

This program never finishes! Jobs B and C don’t get a chance to execute and so the await() calls never return.

Preemptive vs. Cooperative Concurrency

Threads implement a preemptive concurrency model. When one thread blocks, the operating system takes the CPU back (it preempts the thread) and gives it to another thread. Such context switches are expensive and so it’s conventional to do coarse-grained concurrency with threads.

Coroutines implement a cooperative concurrency model. When a coroutine suspends, the dispatcher immediately runs the next coroutine. It’s no problem to have thousands of coroutines so it’s conventional to do fine-grained concurrency in this model.

Mixing and matching models is bad. When a coroutine blocks, it’s being selfish in a model that requires cooperation. A single blocked coroutine can prevent thousands of other coroutines from executing.

When I’ve made this mistake, it blows up like a time bomb:

My server is running smoothly serving thousands of requests per second.
Some blocking I/O thing that my server uses temporarily slows down. Perhaps my server calls an auth service to check credentials, and that service is redeployed.
All of my coroutines dispatchers stall, waiting on blocking calls that accumulate faster than they complete.
My server is wrecked and times out on thousands of requests per second.

I can mix-and-match models without problems for months, and then a small event can trigger a collapse. There’s a DoorDash blog post with details of how this hazard impacted them.

What I’m Doing

Here’s the rules I follow when I use coroutines:

Never call a blocking function from a suspending function.
Use Dispatchers.IO to escape the limitations of rule 1. (Making blocking calls on the I/O dispatcher is fine).
Never call runBlocking(). This rule is much easier to follow now that we can put suspend on main.

It’s possible to write correct programs that are less strict than this! But I dislike the fragility of code that makes assumptions about the caller’s CoroutineDispatcher.

Coroutines are rad and you should use ’em.