Character Encoding

Character encoding is an important topic in software development, especially these days when anyone can publish something on the Web. When you publish software, or even Web content, you want to make the best possible impression. Having strange characters appear on your users' screens definitely does not make that impression.

As an English speaker working in the United States, I've been able to avoid understanding this issue for a long time. Seeing the current state of the Web, so have a lot of other others. But as I started ripping and tagging my classical music CDs, and dealing with names such as Béla Bartók, Iannis Xenakis' Pléïades, and Messiaen's Turangalîla-symphonie, not to mention countless composers' collections of Préludes, it became obvious that I needed to understand this stuff.

In addition, I'm trying to learn German. Consider the following two sentences from the German translation of Harry Potter and the Sorcerer's Stone:

Die Dursleys besaßen alles, was sie wollten, doch sie hatten auch ein Geheimnis, und dass es jemand aufdecken könnte, war ihre größte Sorge. Einfach unerträglich wäre es, wenn die Sache mit den Potters herauskommen würde.

Certainly you've received an e-mail, or read a news story on the Web, and seen things like quote marks, registered trademark symbols, long dashes, and other "special" characters that show up like the above. Not only is this unprofessional, it is nearly unreadable.

A basic understanding of character encoding goes a long way to ending these problems once and for all. This looks a lot better:

Die Dursleys besaßen alles, was sie wollten, doch sie hatten auch ein Geheimnis, und dass es jemand aufdecken könnte, war ihre größte Sorge. Einfach unerträglich wäre es, wenn die Sache mit den Potters herauskommen würde.

Purpose of this article

There are a lot of articles out there about character encoding. Specifically, Joel Spolsky's article is the best introduction I've read on the topic, and I reread it every few months just to make sure I still understand it. It's not my goal to reproduce that article, so if you're starting from scratch, I encourage you to read it. The single most important statement he makes:

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

What I'm hoping to achieve here, is to present a few concrete examples of character encoding in action, including some examples of encoding gone wrong, so that you can start using ä's instead of ä's, ß's instead of ö's, etc. For me, a few real-world examples are necessary for a firm understanding of the concepts.

ASCII

First of all, consider ASCII. Any single byte between 0 and 127 represents a single character. There is a 1-to-1 mapping between bytes and characters. The word byte, for instance, is 4 characters long, and is represented by 4 bytes in ASCII:

62 79 74 65
 b  y  t  e

Now, suppose I want to travel to Germany. I might find myself needing the word fährt. Since there's no ä in ASCII, I can either:

  1. Misspell the word (fahrt).
  2. Choose an alternate spelling (i.e. faehrt).
  3. Use something other than ASCII to represent it.

Unfortunately, the first option is chosen far too often. That's why we have online stores selling Beethoven's Fur Elise. The second option is really a hack; it reminds me of seeing someone's name written as Renee' rather than Reneé, simply because no one felt like going to the effort to produce a proper é. Let's choose the third option.

ISO-8859-1

ASCII may not have an ä, but ISO-8859-1 does. The ISO-8859 encodings use the same 1-to-1 mapping concept seen in ASCII: each character is represented by a byte. There are just more letters defined. So, if I want to represent the word fährt in ISO-8859-1, I do it like this:

66 e4 68 72 74
 f  ä  h  r  t

ä is simply another character code. Problem solved, right?

As long as everyone agrees to use ISO-8859-1, we're fine. But ISO-8859-1 doesn't give me a way to represent the Euro symbol, so if I want to buy something in 's, I need to choose a different encoding like ISO-8859-15. And it will probably work for me. But what if I need to represent an Š? It's starting to get confusing.

The problem is, there are a limited number of values a byte can take on. I can't possibly represent every character of every language using a 1-to-1 mapping, so I have to make choices. Because of the 1-to-1 mapping, the encoding is closely tied to the characters it represents.

Unicode

Unicode doesn't really care about this mapping, or any mapping. Unicode is just a concept to represent characters as codes, and agree on these in a consistent way. We can all agree on which code represents the character ä, without regard to how it is represented in terms of bytes. Once that part has been done (and it has), we can consider how to represent the string of codes as actual bytes.

Let's look at the word fährt again. In Unicode, we can represent the letters with the Unicode characters:

U+0066  U+00e4  U+0068  U+0072  U+0074

But these codes just represent the characters in some theoretical character space. To get actual bytes, that we can use in files or transfer over the Web, we have to choose a way to encode these characters.

UTF-8

UTF-8 is emerging as the way to do this. UTF-8 is capable of representing any and all Unicode characters, but is backward compatible with ASCII. It achieves this by doing away with the 1-to-1 mapping concept of ASCII and the ISO-8859 standards, and using something more akin to Huffman codes. Frequently-occurring characters, like those in the ASCII character set, are given representations that take up a single byte. It's no accident that documents using only ASCII characters are perfectly valid UTF-8 documents; this is the most compelling feature of UTF-8.

Less frequently-occurring (but still fairly common) characters, like accented European characters, may take multiple bytes, but can still be used without increasing the filesize too much. Rarely-used characters, like Chinese glyphs, may take multiple bytes to represent, and indeed UTF-8 may not be the right choice for documents consisting mostly of Chinese characters, for instance. But for Western language documents, even when they contain some some non-Western content, it is generally a good solution.

Let's look back at the word fährt. Most of the characters are in the ASCII alphabet, and are represented identically to their ASCII representation. The only unusual character is ä, which is U+00e4 in Unicode. To encode it in UTF-8, we note that it falls within the range 000080-0007FF (i.e. it is above decimal 127 and below 2048; refer to the Wikipedia article on UTF-8 for details on the mechanics). This particular range is encoded as two bytes, according to the UTF-8 standard. The first of these bytes start with the bits 110, and the second with 10, leaving the remaining bits for us to fill in:

110xxxxx 10yyyyyy

U+00e4 can be represented as 000 1110 0100 in binary (since UTF-8 gives us 11 binary digits to work with in this range, we pad on the left with zeroes). Now, we just fill in these 11 bits in the space UTF-8 gives us for encoding:

   00011   100100       <-- Unicode for ä
     |        |
     v        v
110xxxxx 10yyyyyy       <-- UTF-8 standard
     |        |
     v        v
11000011 10100100       <-- resultant UTF-8 encoding of ä
   c   3    a   4 (hex)

We have the UTF-8 representation of the character ä: the byte sequence 0xc3 0xa4. So, the word fährt is encoded as

66 c3 a4 68 72 74
    \ /
 f   ä    h  r  t

Problems

UTF-8 interpreted as ISO-8859-1

In the first Harry Potter example above, there were lots of funny characters. Sometimes, instead of the word fährt, you might see the nonsense fährt. What causes this?

Consider the byte sequence above, which represents the UTF-8 encoding of fährt. If we didn't know it was encoded with UTF-8, we might try to interpret it as ISO-8859-1, since it's such a common encoding. And, since nearly every byte maps to a character in ISO-8859-1, we can do that. 0xc3 is à and 0xa4 is ¤, causing the word to appear this way:

66 c3 a4 68 72 74
 f  Ã  ¤  h  r  t

The single most common reason for seeing words like fährt, könnte, and würde is a failure to understand the encoding. The reason for this failure is harder to pin down. On the Web, the HTML document can contain a Content-Type <meta> element that declares the encoding, or the web server can send a Content-Type HTTP header. If these don't exist, the browser may try to guess an encoding (and guess incorrectly), or worse, a page may claim that it uses one encoding, but actually use another. This is pretty common on sites that syndicate content from other sites.

ISO-8859-1 interpreted as UTF-8

If we try to interpret the ISO-8859-1 encoding of fährt as if it were UTF-8, we encounter an error. Let's look again at the ISO-8859-1 encoding of fährt, this time as binary:

    f        ä        h        r        t             <-- ISO-8859-1
01100110 11100100 01101000 01110010 01110100
         ^^^^     ^
    f    ok       invalid starting here!              <-- UTF-8

The f is encoded identically in ISO-8859-1 and UTF-8, but the ä causes us problems. Since that byte starts with binary 1110, UTF-8 expects the following byte to start with 10 (and the byte following that one). But the following second byte starts with 01, therefore, it's invalid UTF-8.

Whereas fährt was technically valid but semantically incorrect, trying to interpret the byte sequence above as UTF-8 is just plain wrong. It is not valid UTF-8. Unfortunately, programs are faced with this every day due to incorrectly configured web servers or poorly written web pages. Some programs try to guess what to display (i.e. they probably meant fährt), others display question marks, others fail entirely.

ISO-8859-1 as technically valid UTF-8

There are some circumstances where you could have an ISO-8859-1 string that is valid UTF-8, but represents something different. This is a contrived example, but it makes the point. Take the string 䣣. In ISO-8859-1 you have the sequence:

       ä          £          £
11100100   10100011   10100011
   e   4      a   3      a   3

This is also a valid 3-byte sequence in UTF-8:

11100100   10100011   10100011   <-- our UTF-8 data
      |         |          |
      v         v          v
1110xxxx   10yyyyyy   10zzzzzz   <-- the UTF-8 standard
      |         |          |
      v         v          v
----0100   --100011   --100011   <-- the resultant Unicode character
                |
                v
         0100100011100011
            4   8   e   3

which corresponds to a Chinese character. Even though it is valid UTF-8, it means something else entirely.

Conclusion

Hopefully these concrete examples help you to understand how character encoding works, and how to track down the problem when it doesn't. UTF-8 is becoming more widely supported every day, but we still have to deal with legacy encodings for many years to come.

Resources