Oct 30 2006

UTF-8, 16, 32

Published by at 9:21 am under General,Mac,web

In a mailing list post today, I saw the following succinct explanation of the various UTF’s encodings and how they relate to unicode.

All UTF’s are able to encode “EXACTLY” the same set of characters. That’s the whole point of Unicode. The UTF’s specify the nit-picky details about how to actually represent Unicode text in bytes, but the Unicode text itself is a stream of characters, and it is not affected by the encoding method used.

UTF-8 is an 8-bit encoding only in the sense that it’s defined in terms of bytes. That doesn’t mean that each character takes up only one byte. Depending on the character, it may take up one, two, three, or four bytes.

Example UTF-8 representations:
U+0041 LATIN CAPITAL LETTER A (‘A’) one byte, value 0x41.
U+00E7 LATIN SMALL LETTER C WITH CEDILLA (‘ç’): two bytes, values 0xC3 0xA7
U+0905 DEVANAGARI LETTER A (‘¿’): three bytes, values 0xE0 0xA4 0x85
U+10000 LINEAR B SYLLABLE A (‘¿’): four bytes, values 0xF0 0x90 0x80 0x80

UTF-16 is defined in terms of 16-bit ‘short words’ (two bytes). That doesn’t mean that each character takes up only one word, however. Depending on the character it may take up one or two words: UTF-16 by itself just defines that sequence of 16-bit values; it has nothing to say about how they’re physically stored as bytes.

UTF-8 is useful for several reasons, not least of which is that it’s backwards compatible with ASCII: a 7-bit ASCII text file is, without any modifications whatsoever, a perfectly legal UTF-8 text file. UTF-8 is also reasonably compact for Latin-based scripts. It starts losing the size battle to UTF-16 for the scripts of the Subcontinent and the far East, which is why there are things like SCSU (http://www.unicode.org/reports/tr6/) that let you shift the position of the subset of characters that’s representable with single-byte values.

No responses yet

Trackback URI | Comments RSS

Leave a Reply