[Tutor] how to struct.pack a unicode string?
Steven D'Aprano
steve at pearwood.info
Tue Jan 1 07:29:48 CET 2013
I'm digging out an old email which I saved as a draft almost a month ago
but never got around to sending, because I think the new Unicode
implementation in Python 3.3 is one of the coolest things ever.
On 03/12/12 16:56, eryksun wrote:
> CPython 3.3 has a new implementation that angles for the best of all
> worlds, opting for a 1-byte, 2 byte, or 4-byte representation
> depending on the maximum code in the string. The internal
> representation doesn't use surrogates, so there's no more narrow vs
> wide build distinction.
The consequences of this may not be clear to some people. Here's the
short version:
The full range of 1114112 Unicode code points (informally "characters")
do not fit the space available to two bytes. Two bytes can cover the
values 0 through 65535 (0xFFFF in hexadecimal), while Unicode code
points go up to 1114112 (0x10FFFF). So what to do? There are three
obvious solutions:
1 If you store each character using four bytes, you can cover the
entire Unicode range. The downside is that for English speakers and
ASCII users, strings will use four times as much memory as you
expect: e.g. the character 'A' will be stored as 0x00000041 (four
bytes instead of one in pure ASCII).
When you compile the Python interpreter, you can set an option to
do this. This is called a "wide" build.
2 Since "wide builds" use so much extra memory for the average ASCII
string, hardly anyone uses them. Instead, the default setting for
Python is a "narrow" build: characters use only two bytes, which is
enough for most common characters. E.g. e.g. the character 'A' will
be stored as 0x0041.
The less common characters can't be represented as a single two-
byte character, so Unicode defines a *pair of characters* to
indicate the extra (hopefully rare) characters. These are called
"surrogate pairs". For example, Unicode code point 0x10859 is too
large for a pair of bytes. So in Python 3.2, you get this:
py> c = chr(0x10859) # IMPERIAL ARAMAIC NUMBER TWO
py> print(len(c), [hex(ord(x)) for x in c])
2 ['0xd802', '0xdc59']
Notice that instead of getting a single character, you get two
characters. Your software is then supposed to manually check for
such surrogate pairs. Unfortunately nobody does, because that's
complicated and slow, so people end up with code that cannot handle
strings with surrogate pairs safely. It's easy to break the pair up
and get invalid strings that don't represent any actual character.
In other words, Python *wide builds* use too much memory, and
*narrow builds* are buggy and let you break strings. Oops.
3 Python 3.3 takes a third option: when you create a string object,
the compiler analyses the string, works out the largest character
used, and only then decides how many bytes per character to use.
So in Python 3.3, the decision to use "wide" strings (4 bytes per
character) or "narrow" strings (2 bytes) is no longer made when
compiling the Python interpreter. It is made per string, with the
added bonus that purely ASCII or Latin1 strings can use 1 byte
per character. That means, no more surrogate pairs, and every
Unicode character is now a single character:
py> c = chr(0x10859) # Python 3.3
py> print(len(c), [ord(x) for x in c])
1 ['0x10859']
and a good opportunity for large memory savings.
How big are the memory savings? They can be substantial. Purely Latin1
strings (so-called "extended ASCII") can be close to half the size of a
narrow build:
[steve at ando ~]$ python3.2 -c "import sys; print(sys.getsizeof('ñ'*1000))"
2030
[steve at ando ~]$ python3.3 -c "import sys; print(sys.getsizeof('ñ'*1000))"
1037
I don't have a wide build to test, but the size would be roughly twice as
big again, about 4060 bytes.
But more important than the memory savings, it means that for the first
time Python's handling of Unicode strings is correct for the entire range
of all one million plus characters, not just the first 65 thousand.
And that, I think, is a really important step. All we need now is better
fonts that support more of the Unicode range so we can actually *see* the
characters.
--
Steven
More information about the Tutor
mailing list