[Tutor] why is unichr(sys.maxunicode) blank?

Steven D'Aprano steve at pearwood.info
Sat May 18 05:49:38 CEST 2013


On 18/05/13 05:23, Albert-Jan Roskam wrote:

> I was curious what the "high" four-byte ut8 unicode characters look like.

By the way, your sentence above reflects a misunderstanding. Unicode characters (strictly speaking, code points) are not "bytes", four or otherwise. They are abstract entities represented by a number between 0 and 1114111, or in hex, 0x10FFFF. Code points can represent characters, or parts of characters (e.g. accents, diacritics, combining characters and similar), or non-characters.

Much confusion comes from conflating bytes and code points, or bytes and characters. The first step to being a Unicode wizard is to always keep them distinct in your mind. By analogy, the floating point number 23.42 is stored in memory or on disk as a bunch of bytes, but there is nothing to be gained from confusing the number 23.42 from the bytes 0xEC51B81E856B3740, which is how it is stored as a C double.

Unicode code points are abstract entities, but in the real world, they have to be stored in a computer's memory, or written to disk, or transmitted over a wire, and that requires *bytes*. So there are three Unicode schemes for storing code points as bytes. These are called *encodings*. Only encodings involve bytes, so it is nonsense to talk about "four-byte" unicode characters, since it conflates the abstract Unicode character set with one of various concrete encodings.

There are three standard Unicode encodings. (These are not to be confused with the dozens of "legacy encodings", a.k.a. code pages, used prior to the Unicode standard. They do not cover the entire range of Unicode, and are not part of the Unicode standard.) These encodings are:

UTF-8
UTF-16
UTF-32 (also sometimes known as UCS-4)

plus at least one older, obsolete encoding, UCS-2.

UTF-32 is the least common, but simplest. It simply maps every code point to four bytes. In the following, I will follow this convention:

- code points are written using the standard Unicode notation, U+xxxx where the x's are hexadecimal digits;

- bytes are written in hexadecimal, using a leading 0x.

Code point U+0000 -> bytes 0x00000000
Code point U+0001 -> bytes 0x00000001
Code point U+0002 -> bytes 0x00000002
...
Code point U+10FFFF -> bytes 0x0010FFFF


It is simple because the mapping is trivially simple, and uncommon because for typical English-language text, it wastes a lot of memory.

The only complication is that UTF-32 depends on the endianess of your system. In the above examples I glossed over this factor. In fact, there are two common ways that bytes can be stored:

- "big endian", where the most-significant (largest) byte is on the left (lowest address);
- "little endian", where the most-significant (largest) byte is on the right.

So in a little-endian system, we have this instead:

Code point U+0000 -> bytes 0x00000000
Code point U+0001 -> bytes 0x01000000
Code point U+0002 -> bytes 0x02000000
...
Code point U+10FFFF -> bytes 0xFFFF1000

(Note that little-endian is not merely the reverse of big-endian. It is the order of bytes that is reversed, not the order of digits, or the order of bits within each byte.)

So when you receive a bunch of bytes that you know represents text encoded using UTF-32, you can bunch the bytes in groups of four and convert them to Unicode code points. But you need to know the endianess. One way to do that is to add a Byte Order Mark at the beginning of the bytes. If you look at the first four bytes, and it looks like 0x0000FEFF, then you have big-endian UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.

So that's UTF-32. UTF-16 is a little more complicated.

UTF-16 divides the Unicode range into two groups:

* The first (approximately) 65000 code points which are represented as two bytes;

* Everything else, which are represented as a pair of double bytes, so-called "surrogate pairs".

For the first 65000-odd code points, the mapping is trivial, and relatively compact:

code point U+0000 => bytes 0x0000
code point U+0001 => bytes 0x0001
code point U+0002 => bytes 0x0002
...
code point U+FFFF => bytes 0xFFFF


Code points beyond that point are encoded into a pair of double bytes (four bytes in total):

code point U+10000 => bytes 0xD800 DC00
...
code point U+10FFFF => bytes 0xDBFF DFFF


Notice a potential ambiguity here. If you receive a byte 0xD800, is that the start of a surrogate pair, or the code point U+D800? The Unicode standard resolves this ambiguity by officially reserving code points U+D800 through U+DFFF for use as surrogate pairs in UTF-16.

Like UTF-32, UTF-16 also has to distinguish between big-endian and little-endian. It does so with a leading BOM, only this time it is two bytes, not four:

0xFEFF => big-endian
0xFFFE => little-endian


Last but not least, we have UTF-8. UTF-8 is slowly becoming the standard for storing Unicode on disk, because it is very compact for common English-language text, backwards-compatible with ASCII text files, and doesn't require a BOM. (Although Microsoft software sometimes adds a UTF-8 signature at the start of files, namely the three bytes 0xEFBBBF.)

UTF-8 is also a variable-width encoding. Unicode code-points are mapped to one, two, three or four bytes, as needed:

Code points U+0000 to U+007E => 1 byte
Code points U+0080 to U+07FF => 2 bytes
Code points U+0800 to U+FFFF => 3 bytes
Code points U+10000 to U+10FFFF => 4 bytes

(Older versions of UTF-8 could go up to six bytes, but now that Unicode is officially limited to exactly 0x10FFFF code points, it now only goes up to four bytes.)



-- 
Steven


More information about the Tutor mailing list