[Python-Dev] PEP-393/PEP-3118: unicode format specifiers

Victor Stinner victor.stinner at gmail.com
Tue Mar 6 17:43:43 CET 2012


> In the array module the 'u' specifier previously meant "2-bytes, on wide
> builds 4-bytes". Currently in 3.3 the 'u' specifier is mapped to UCS4.
>
> I think it would be nice for Python3.3 to implement the PEP-3118
> suggestion:
>
> 'c' -> UCS1
>
> 'u' -> UCS2
>
> 'w' -> UCS4

A Unicode string is an array of code point. Another approach is to
expose such string as an array of uint8/uint16/uint32 integers. I
don't know if you expect to get a character / a substring when you
read the buffer of a string object. Using Python 3.2, I get:

>>> memoryview(b"abc")[0]
b'a'

... but using Python 3.3 I get a number :-)

>>> memoryview(b'abc')[0]
97

It is no more possible to create a Unicode string containing
characters outside U+0000-U+10FFFF range. You might apply the same
restriction in the buffer API for UCS4. It may be inefficient, the
check can be done when you convert the buffer to a string.

> Actually we could even add 'a' -> ASCII

ASCII implies that the values are in the range U+0000-U+007F (0-127).
Same as the UCS4: you may do the check in the buffer API or when the
buffer is converted to string.

I don't think that it would be useful to add an ASCII buffer type,
because when the buffer is converted to string, Python has to
recompute the maximum character (to choose between ASCII, UCS1, UCS2
and UCS4). For example, "abc\xe9"[:-1] is ASCII. UCS1 is enough for
ASCII strings.

Victor


More information about the Python-Dev mailing list