[Python-Dev] PEP-393/PEP-3118: unicode format specifiers

Stefan Krah stefan at bytereef.org
Tue Mar 6 19:15:16 CET 2012


Victor Stinner <victor.stinner at gmail.com> wrote:
> > 'c' -> UCS1
> > 'u' -> UCS2
> > 'w' -> UCS4
> 
> A Unicode string is an array of code point. Another approach is to
> expose such string as an array of uint8/uint16/uint32 integers. I
> don't know if you expect to get a character / a substring when you
> read the buffer of a string object. Using Python 3.2, I get:
> 
> >>> memoryview(b"abc")[0]
> b'a'
> 
> ... but using Python 3.3 I get a number :-)

Yes, that's changed because officially (see struct module) the format
is unsigned bytes, which are integers in struct module syntax:

>>> unsigned_bytes = memoryview(b"abc")
>>> unsigned_bytes.format
'B'
>>> char_array = unsigned_bytes.cast('c')
>>> char_array.format
'c'
>>> char_array[0]
b'a'


Possibly the uint8/uint16/uint32 integer approach that you mention
would make more sense.


Stefan Krah




More information about the Python-Dev mailing list