Each time you encode characters as bytes, you should do two things:
- Use UTF-8.
- Explicitly specify that UTF-8 is your charset.
Conversely, whenever you decode bytes to characters, you should also do two things:
- Confirm that the charset is explicitly set to UTF-8.
- Use UTF-8.
If you're working with data encoded in something other than UTF-8, or data with no charset specified, you should demand UTF-8.
Switching to UTF-8 everywhere is progress. If we all do this, then future generations won't need to learn the horrors, hassles and bugs of managing multiple character encodings.
A brief history of standardization
In 1964, the first IBM computers with 8-bit bytes were introduced. Today 8-bit bytes are universal and nobody maintains code to support 6-bit bytes.
In 1982 the US military standardized on TCP/IP. Today IP networking is universal and nobody maintains code to support IPX/SPX or AppleTalk.
In 1993, UTF-8 was released. We're very near to the day that we can drop support for ISO-8859-1 and many other obsolete character sets.
Why UTF-8
Because all other encodings are inferior:
- ASCII and ISO-8859-1 lack essential characters.
- UTF-16 is overweight, has endianness problems and needs surrogate pairs.
- UTF-32 is obese.
See UTF-8 Everywhere for a more complete comparison, and tactics for upgrading to UTF-8 in your applications.