[Tutor] why is unichr(sys.maxunicode) blank?

Steven D'Aprano steve at pearwood.info
Sat May 18 14:12:16 CEST 2013


On 18/05/13 20:01, Albert-Jan Roskam wrote:

> Thanks for all your replies. I knew about code points, but to represent the unicode string (code point) as a utf-8 byte string (bytes), characters 0-127 are 1 byte (of 8 bits), then 128-255 (accented chars)
> are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on Joel Spolsky's "standard" page about unicode I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".

The UTF-8 data structure was originally designed to go up to 6 bytes, but since Unicode itself is limited to 1114111 code points, no more than 4 bytes are needed for UTF-8.

Also, it is wrong to say that the 4-byte UTF-8 values are "East Asian languages". The full Unicode range contains 17 "planes" of 65,536 code points. The first such plane is called the "Basic Multilingual Plane", and it includes all the code points that can be represented in 1 to 3 UTF-8 bytes. The BMP includes in excess of 13,000 East Asian code points, e.g.:


py> import unicodedata as ud
py> c = '\u3050'
py> print(c, ud.name(c), c.encode('utf-8'))
P HIRAGANA LETTER GU b'\xe3\x81\x90'


The 4-byte UTF-8 values are in the second and subsequent planes, called "Supplementary Multilingual Planes". They include historical character sets such as Egyptian hieroglyphs, cuneiform, musical and mathematical symbols, Emoji, gaming symbols, Ancient Arabic and Persian, and many others.

http://en.wikipedia.org/wiki/Plane_(Unicode)


> I always viewed the codepage as "the bunch of chars on top of ascii", e.g. cp1252 (latin-1) is ascii (0-127) +  another 128 characters that are used in Europe (euro sign, Scandinavian and Mediterranean (Spanish), but not Slavian chars).

Well, that's certainly common, but not all legacy encodings are supersets of ASCII. For example:

http://en.wikipedia.org/wiki/Big5

although I see that Python's implementation of Big5 is *technically* incorrect, although *practically* useful, as it does include ASCII.


> A certain locale implies a certain codepage (on Windows), but where does the locale category LC_CTYPE fit in this story?

No idea :-)




>> UTF-8
>> UTF-16
>> UTF-32 (also sometimes known as UCS-4)
>>
>> plus at least one older, obsolete encoding, UCS-2.
>
> Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe this is a different abbreviation. I read about bit multilingual plane (BMP) and surrogate pairs and all. The author suggested that messing with surrogate pairs is a topic to dive into in case one's nail bed is being derusted. I wholeheartedly agree.

UCS-2 is a fixed-width encoding that is identical to UTF-16 for code points up to U+FFFF. It differs from UTF-16 in that it *cannot* encode code points U+10000 and higher, in other words, it does not support surrogate pairs. So UCS-2 is obsolete in the sense it doesn't include the whole set of Unicode characters.

In Python 3.2 and older, Python has a choice between a *narrow build* that uses UTF-16 (including surrogates) for strings in memory, or a *wide build* that uses UTF-32. The choice is made when you compile the Python interpreter. Other programming languages may use other systems.

Python 3.3 uses a different, more flexible scheme for keeping strings in memory. Depending on the largest code point in a string, the string will be stored in either Latin-1 (one byte per character), UCS-2 (two bytes per character, and no surrogates) or UTF-32 (four bytes per character). This means that there is no longer a need for surrogate pairs, but only strings that *need* four bytes per character will use four bytes.



>> - "big endian", where the most-significant (largest) byte is on the left (lowest address);
>> - "little endian", where the most-significant (largest) byte is on the right.
>
>
> Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is "utf-8" a shorthand for saying "utf-8 le"?

Endianness is relevant for UTF-16 too.

It is not relevant for UTF-8 because UTF-8 defines the order that multiple bytes must appear. UTF-8 is defined in terms of *bytes*, not multi-byte words. So the code point U+3050 is encoded into three bytes, *in this order*:

0xE3 0x81 0x90

There's no question about which byte comes first, because the order is set. But UTF-16 defines the encoding in terms of double-byte words, so the question of how words are stored becomes relevant. A 16-bit word can be laid out in memory in at least two ways:

[most significant byte] [least significant byte]

[least significant byte] [most significant byte]

so U+3050 could legitimately appear as bytes 0x3050 or 0x5030 depending on the machine you are using.

It's hard to talk about endianness without getting confused, or at least for me it is :-) Even though I've written down 0x3050 and 0x5030, it is important to understand that they both have the same numeric value of 12368 in decimal. The difference is just in how the bytes are laid out in memory. By analogy, Arabic numerals used in English and other Western languages are written in *big endian order*:

1234 means 1 THOUSAND 2 HUNDREDS 3 TENS 4 UNITS

Imagine a language that wrote numbers in *little endian order*, but using the same digits. You would count:

0
1
2
...
01  # no UNITS 1 TEN
11  # 1 UNITS 1 TEN
21  # 2 UNITS 1 TEN
...
4321  # 4 UNITS 3 TENS 2 HUNDREDS 1 THOUSAND


Since both UTF-16 and UTF-32 are defined in terms of 16 or 32 bit words, endianness is relevant; since UTF-8 is defined in terms of 8-bit bytes, it is not.

Fortunately, all(?) modern computing hardware has standardized on the same "endianness" of individual bytes. This was not always the case, but today if you receive a byte with bits:

0b00110000

then there is no(?) doubt that it represents decimal 48, not 12.



>> So when you receive a bunch of bytes that you know represents text encoded using UTF-32, you can bunch the bytes in groups of four and convert them to Unicode code points. But you need to know the endianess. One way to do that is to add a Byte Order Mark at the beginning of the bytes. If you look at the first four bytes, and it looks like 0x0000FEFF, then you have big-endian UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.
>
> So each byte starts with a BOM? Or each file? I find utf-32 indeed the easiest to understand.

Certainly not each byte! That would be impossible, since the BOM itself is *two bytes* for UTF-16 and *four bytes* for UTF-32.

Remember, a BOM is not compulsory. If you decide before hand that you will always use big-endian UTF-16, say, there is no need to waste time with a BOM. But then you're responsible for producing big-endian words even if your hardware is little-endian.

A BOM is useful when you're transmitting a file to somebody else, and they *might* not have the same endianness as you. If you can pass a message on via some other channel, you can say "I'm about to send you a file in little-endian UTF-16" and all will be good. But since you normally can't, you just insert the BOM at the start of the file, and they can auto-detect the endianness.

How do they do that? Because they read the first two bytes. If they read it as 0xFFFE, that tells them that their byte-order and my byte-order are mismatched, and they should just use the opposite byte-order from what their system uses by default. If they read it as 0xFEFF, our endianness match, and we're right to go.

You can stick a BOM at the beginning of every string, but that's rather wasteful, and it leads to difficulty with string processing (especially concatenating strings), so it's best not to use BOMs except *at most* once per file.


> In utf-8, how does a system "know" that the given octet of bits is to be interpreted as a single-byte character, or rather like "hold on, these eight bits are gibberish as they are right now, let's check what happens if we add the next eight bits", in other words a multibyte char (forgive me the naive phrasing ;-). Why I mention is in the context of BOM: why aren't these needed to indicate "mulitbyte char ahead!"?

Because UTF-8 is a very cunning system that was designed by very clever people (Dave Prosser and Ken Thompson) to be unambiguous when read one byte at a time.

When reading a stream of UTF-8 bytes, you look at the first bit of the current byte. If it is a zero, then you have a single-byte code, so you can decode that byte and move on to the next byte. A single byte with a leading 0 gives you 127 possible different values. (If this sounds like ASCII, that's not a coincidence.)

But if the current byte starts with bits 110, then you throw those three bits away, and keep the next five bits. Then you read the next byte, check that it starts with bits 10, and keep the six bits following that. That gives you 5+6 = 11 useful bits in total, from two bytes read, which is enough to encode a further 2047 distinct values.

If the current byte starts with bits 1110, then you throw those four bits away and keep the next four. Then you read in two more bytes, check that they both start with bits 10, and keep the next six bits from each. This gives you 4+6+6 = 16 bits in total, which encodes a further 65535 values.

If the current byte starts with 11110, you throw away those five bits and read in the next three bytes. This gives you 3+6+6+6 = 21 bits, which is enough to encode 2097151 values. So in total, that gives you 127+2047+65535+2097151 = 2164860 distinct values, which is more than the number we actually need.

(Notice that the number of leading 1s in the first byte tells you how many bytes you need to read. Also note that not all byte sequences are valid UTF-8.) In summary:

U+0000 - U+007F => 0xxxxxxx
U+0080 - U+07FF => 110xxxxx 10xxxxxx
U+0800 - U+FFFF => 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+1FFFFF => 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx



>> So that's UTF-32. UTF-16 is a little more complicated.
>>
>> UTF-16 divides the Unicode range into two groups:
>>
>> * The first (approximately) 65000 code points which are represented as two bytes;
>>
>> * Everything else, which are represented as a pair of double bytes, so-called "surrogate pairs".
>
>
> Just as I thought I was starting to understand it.... Sorry. len(unichr(63000).encode("utf-8")) returns three bytes.

You're using UTF-8. I'm talking about UTF-16.





-- 
Steven


More information about the Tutor mailing list