PUBLIC OBJECT

Demand UTF-8

Each time you encode characters as bytes, you should do two things:

  • Use UTF-8.
  • Explicitly specify that UTF-8 is your charset.

Conversely, whenever you decode bytes to characters, you should also do two things:

  • Confirm that the charset is explicitly set to UTF-8.
  • Use UTF-8.

If you're working with data encoded in something other than UTF-8, or data with no charset specified, you should demand UTF-8.

Switching to UTF-8 everywhere is progress. If we all do this, then future generations won't need to learn the horrors, hassles and bugs of managing multiple character encodings.

A brief history of standardization

In 1964, the first IBM computers with 8-bit bytes were introduced. Today 8-bit bytes are universal and nobody maintains code to support 6-bit bytes.

In 1982 the US military standardized on TCP/IP. Today IP networking is universal and nobody maintains code to support IPX/SPX or AppleTalk.

In 1993, UTF-8 was released. We're very near to the day that we can drop support for ISO-8859-1 and many other obsolete character sets.

Why UTF-8

Because all other encodings are inferior:

  • ASCII and ISO-8859-1 lack essential characters.
  • UTF-16 is overweight, has endianness problems and needs surrogate pairs.
  • UTF-32 is obese.

See UTF-8 Everywhere for a more complete comparison, and tactics for upgrading to UTF-8 in your applications.