Atom Feed SITE FEED   ADD TO GOOGLE READER

Why there's no String.getCharset()

With Java's String class, there's a 2 arg constructor that takes bytes and charsetName:
byte[] characters = { 83, 87, 65, 78, 75, 46, 67, 65 };
String myString = new String(characters, "ISO 8859-1");

Symmetrically, you might expect this:
byte[] theBytes = myString.getBytes();
String theCharset = myString.getCharsetName()

The second line doesn't compile because there's no getCharsetName() method on String.

Why? Internally, Strings are always UTF-16, regardless of what charset was used to create them. The String(byte[],charset) constructor converts the bytes into UTF-16 characters using the charset as a guide.

This turns out to be very handy:
  • Since all Strings use the same charset, there's no need to convert charsets when doing compareTo(), indexOf() or equals().
  • Once you have a String, you don't need to think about its character set! Charsets and encodings only matter when you're converting between byte[]s and Strings.

    Unfortunately, some code in an otherwise awesome project that misunderstands this concept has caused me some grief! Hopefully everything will be resolved soon.

    One cause of this problem is that Java developers have been trained to expect that constructor arguments will be used to initialize an object's properties. Perhaps instead of a constructor, Java's designers could have used a simple factory method to make the decoding action more explicit:
    public static String decodeBytes(byte[], charset)


    For a great overview of why character sets are the way they are, check out Joel Spolsky's article.