We’re in the age of Emoji. We fought in the charset wars, obsoleted old bad encodings like ISO-8859-1, and emerged victorious. Being able to sprinkle our writing with sprinked donut emoji is the English speaker’s upside to ubiquitous UTF-8.
Unfortunately, java.io.Reader
is stuck in the UTF-16 ghetto: all of the hard work of internationalization but without the donut emoji.
An example
Let’s read characters from a Reader
until we hit ‘🍺’ (0x0x1f37a
), and then we'll stop. The naïve solution doesn't work:
byte[] data = new byte[] {
(byte) 0x68, (byte) 0x65, (byte) 0x6c, (byte) 0x6c,
(byte) 0x6f, (byte) 0xf0, (byte) 0x9f, (byte) 0x8d,
(byte) 0xa9, (byte) 0x77, (byte) 0x6f, (byte) 0x72,
(byte) 0x6c, (byte) 0x64, (byte) 0xf0, (byte) 0x9f,
(byte) 0x8d, (byte) 0xba
};
Reader reader = new InputStreamReader(new ByteArrayInputStream(data)));
for (int c; (c = reader.read()) != 0x1f37a; ) {
System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}
This crashes because the single codepoint ‘🍺’ is returned in two halves. We miss the beer altogether and run off the end of the string.
00000068: h
00000065: e
0000006c: l
0000006c: l
0000006f: o
0001f369: 🍩
00000077: w
0000006f: o
00000072: r
0000006c: l
00000064: d
0000d83c: ?
0000df7a: ?
Exception in thread "main" java.lang.IllegalArgumentException: -1
at java.lang.String.<init>(String.java:256)
at Example.main(Example.java:12)
The workaround is unfortunate. We need to glue the two UTF-16 halves together manually. If for whatever reason we get a top half without a corresponding bottom half, that's our problem.
static int readCodePoint(Reader reader) throws IOException {
int c1 = reader.read();
if (c1 == -1 || !Character.isSurrogate((char) c1)) {
return c1; // c1 is an easy non-surrogate character. We're done.
}
if (Character.isLowSurrogate((char) c1)) {
// c1 is a low surrogate but we need a high one.
return '\ufffd'; // (That's the replacement character.)
}
// We have a high-surrogate. Hopefully next is a low surrogate.
reader.mark(1);
int c2 = reader.read();
if (c2 == -1 || !Character.isLowSurrogate((char) c2)) {
// Didn't get what we want. Push it back.
reader.reset();
return '\ufffd'; // (That's the replacement character again.)
}
// c1 and c2 form a surrogate pair. Join 'em.
return Character.toCodePoint((char) c1, (char) c2);
}
This does what we want. Though the fact that we need to use mark and reset means we have to wrap our Reader
in a BufferedReader
.
...
Reader reader = new BufferedReader(
new InputStreamReader(new ByteArrayInputStream(data)));
for (int c; (c = readCodePoint(reader)) != 0x1f37a; ) {
System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}
Okio makes this easy
Because it’s a child of the emoji age, Okio’s Buffer
does UTF-8 natively. And in Okio 1.4.0, we have a new API to read a code point directly from a source.
ByteString data = ByteString.decodeHex("68656c6c6ff09f8da9776f726c64f09f8dba");
BufferedSource source = new Buffer().write(data);
for (int c; (c = source.readUtf8CodePoint()) != 0x1f37a; ) {
System.out.printf("%08x: %s%n", c, new String(new int[] { c }, 0, 1));
}
Get Okio
Get Okio from Maven Central:
<dependency>
<groupId>com.squareup.okio</groupId>
<artifactId>okio</artifactId>
<version>1.4.0</version>
</dependency>