[Python-Dev] PEP-393/PEP-3118: unicode format specifiers

Wed Mar 7 01:17:25 CET 2012

On Wed, Mar 7, 2012 at 4:15 AM, Stefan Krah <stefan at bytereef.org> wrote:
> Victor Stinner <victor.stinner at gmail.com> wrote:
>> A Unicode string is an array of code point. Another approach is to
>> expose such string as an array of uint8/uint16/uint32 integers. I
>> don't know if you expect to get a character / a substring when you
>> read the buffer of a string object. Using Python 3.2, I get:
>>
>> >>> memoryview(b"abc")[0]
>> b'a'
>>
>> ... but using Python 3.3 I get a number :-)
>
> Yes, that's changed because officially (see struct module) the format
> is unsigned bytes, which are integers in struct module syntax:
>
>>>> unsigned_bytes = memoryview(b"abc")
>>>> unsigned_bytes.format
> 'B'
>>>> char_array = unsigned_bytes.cast('c')
>>>> char_array.format
> 'c'
>>>> char_array[0]
> b'a'

To maintain backwards compatibility, we should probably take the
purity hit and officially change the default format of memoryview() to
'c', requiring the explicit cast to 'B' to get the new more bytes-like
behaviour.

Using 'c' as the default format is a little ugly, but not as ugly as
breaking currently working 3.2 code in the upgrade to 3.3.

> Possibly the uint8/uint16/uint32 integer approach that you mention
> would make more sense.

Any changes made in this area should be aimed specifically at making
life easier for developers dealing with ASCII puns in binary
protocols. Being able to ask a string for a memoryview, and receiving
one back with the format set to the appropriate value could
potentially help with that by indicating:

ASCII: each code point is mapped to an integer in the range 0-127
latin-1: each code point is mapped to an integer in the range 0-255
UCS2: each code point is mapped to an integer in the range 0-65535
UCS4: each code point is mapped to an integer in the range 0-0x10FFFF

Using the actual code point values rather than bytes representations
which may vary in length can help gloss over the differences in the
underlying data layout. However, use cases should be explored more
thoroughly *first* before any additional changes are made to the
supported formats.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia