[Tutor] how to struct.pack a unicode string?

Tue Jan 1 07:29:48 CET 2013

I'm digging out an old email which I saved as a draft almost a month ago
but never got around to sending, because I think the new Unicode
implementation in Python 3.3 is one of the coolest things ever.

On 03/12/12 16:56, eryksun wrote:

> CPython 3.3 has a new implementation that angles for the best of all
> worlds, opting for a 1-byte, 2 byte, or 4-byte representation
> depending on the maximum code in the string. The internal
> representation doesn't use surrogates, so there's no more narrow vs
> wide build distinction.

The consequences of this may not be clear to some people. Here's the
short version:

The full range of 1114112 Unicode code points (informally "characters")
do not fit the space available to two bytes. Two bytes can cover the
values 0 through 65535 (0xFFFF in hexadecimal), while Unicode code
points go up to 1114112 (0x10FFFF). So what to do? There are three
obvious solutions:

1 If you store each character using four bytes, you can cover the
   entire Unicode range. The downside is that for English speakers and
   ASCII users, strings will use four times as much memory as you
   expect: e.g. the character 'A' will be stored as 0x00000041 (four
   bytes instead of one in pure ASCII).

   When you compile the Python interpreter, you can set an option to
   do this. This is called a "wide" build.

2 Since "wide builds" use so much extra memory for the average ASCII
   string, hardly anyone uses them. Instead, the default setting for
   Python is a "narrow" build: characters use only two bytes, which is
   enough for most common characters. E.g. e.g. the character 'A' will
   be stored as 0x0041.

   The less common characters can't be represented as a single two-
   byte character, so Unicode defines a *pair of characters* to
   indicate the extra (hopefully rare) characters. These are called
   "surrogate pairs". For example, Unicode code point 0x10859 is too
   large for a pair of bytes. So in Python 3.2, you get this:

   py> c = chr(0x10859)  # IMPERIAL ARAMAIC NUMBER TWO
   py> print(len(c), [hex(ord(x)) for x in c])
   2 ['0xd802', '0xdc59']

   Notice that instead of getting a single character, you get two
   characters. Your software is then supposed to manually check for
   such surrogate pairs. Unfortunately nobody does, because that's
   complicated and slow, so people end up with code that cannot handle
   strings with surrogate pairs safely. It's easy to break the pair up
   and get invalid strings that don't represent any actual character.

   In other words, Python *wide builds* use too much memory, and
   *narrow builds* are buggy and let you break strings. Oops.

3 Python 3.3 takes a third option: when you create a string object,
   the compiler analyses the string, works out the largest character
   used, and only then decides how many bytes per character to use.

   So in Python 3.3, the decision to use "wide" strings (4 bytes per
   character) or "narrow" strings (2 bytes) is no longer made when
   compiling the Python interpreter. It is made per string, with the
   added bonus that purely ASCII or Latin1 strings can use 1 byte
   per character. That means, no more surrogate pairs, and every
   Unicode character is now a single character:

   py> c = chr(0x10859)  # Python 3.3
   py> print(len(c), [ord(x) for x in c])
   1 ['0x10859']

   and a good opportunity for large memory savings.

How big are the memory savings? They can be substantial. Purely Latin1
strings (so-called "extended ASCII") can be close to half the size of a
narrow build:

[steve at ando ~]$ python3.2 -c "import sys; print(sys.getsizeof('ñ'*1000))"
2030
[steve at ando ~]$ python3.3 -c "import sys; print(sys.getsizeof('ñ'*1000))"
1037

I don't have a wide build to test, but the size would be roughly twice as
big again, about 4060 bytes.

But more important than the memory savings, it means that for the first
time Python's handling of Unicode strings is correct for the entire range
of all one million plus characters, not just the first 65 thousand.

And that, I think, is a really important step. All we need now is better
fonts that support more of the Unicode range so we can actually *see* the
characters.

-- 
Steven