No beer emoji for java.io.Reader

We’re in the age of Emoji. We fought in the charset wars, obsoleted old bad encodings like ISO-8859-1, and emerged victorious. Being able to sprinkle our writing with sprinked donut emoji is the English speaker’s upside to ubiquitous UTF-8.

Unfortunately, java.io.Reader is stuck in the UTF-16 ghetto: all of the hard work of internationalization but without the donut emoji.

An example

Let’s read characters from a Reader until we hit ‘🍺’ (0x0x1f37a), and then we'll stop. The naïve solution doesn't work:

byte[] data = new byte[] {
    (byte) 0x68, (byte) 0x65, (byte) 0x6c, (byte) 0x6c,
    (byte) 0x6f, (byte) 0xf0, (byte) 0x9f, (byte) 0x8d,
    (byte) 0xa9, (byte) 0x77, (byte) 0x6f, (byte) 0x72,
    (byte) 0x6c, (byte) 0x64, (byte) 0xf0, (byte) 0x9f,
    (byte) 0x8d, (byte) 0xba
};
Reader reader = new InputStreamReader(new ByteArrayInputStream(data)));
for (int c; (c = reader.read()) != 0x1f37a; ) {
  System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}

This crashes because the single codepoint ‘🍺’ is returned in two halves. We miss the beer altogether and run off the end of the string.

00000068: h
00000065: e
0000006c: l
0000006c: l
0000006f: o
0001f369: 🍩
00000077: w
0000006f: o
00000072: r
0000006c: l
00000064: d
0000d83c: ?
0000df7a: ?
Exception in thread "main" java.lang.IllegalArgumentException: -1
	at java.lang.String.<init>(String.java:256)
	at Example.main(Example.java:12)

The workaround is unfortunate. We need to glue the two UTF-16 halves together manually. If for whatever reason we get a top half without a corresponding bottom half, that's our problem.

static int readCodePoint(Reader reader) throws IOException {
  int c1 = reader.read();

  if (c1 == -1 || !Character.isSurrogate((char) c1)) {
    return c1; // c1 is an easy non-surrogate character. We're done.
  }

  if (Character.isLowSurrogate((char) c1)) {
    // c1 is a low surrogate but we need a high one.
    return '\ufffd'; // (That's the replacement character.)
  }

  // We have a high-surrogate. Hopefully next is a low surrogate.
  reader.mark(1);
  int c2 = reader.read();
  if (c2 == -1 || !Character.isLowSurrogate((char) c2)) {
    // Didn't get what we want. Push it back.
    reader.reset();
    return '\ufffd'; // (That's the replacement character again.)
  }

  // c1 and c2 form a surrogate pair. Join 'em.
  return Character.toCodePoint((char) c1, (char) c2);
}

This does what we want. Though the fact that we need to use mark and reset means we have to wrap our Reader in a BufferedReader.

...
Reader reader = new BufferedReader(
    new InputStreamReader(new ByteArrayInputStream(data)));
for (int c; (c = readCodePoint(reader)) != 0x1f37a; ) {
  System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}

Okio makes this easy

Because it’s a child of the emoji age, Okio’s Buffer does UTF-8 natively. And in Okio 1.4.0, we have a new API to read a code point directly from a source.

ByteString data = ByteString.decodeHex("68656c6c6ff09f8da9776f726c64f09f8dba");
BufferedSource source = new Buffer().write(data);
for (int c; (c = source.readUtf8CodePoint()) != 0x1f37a; ) {
  System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}

Get Okio

Get Okio from Maven Central:

<dependency>
  <groupId>com.squareup.okio</groupId>
  <artifactId>okio</artifactId>
  <version>1.4.0</version>
</dependency>