[Python-Dev] PEP 393: Flexible String Representation
stefan_ml at behnel.de
Sat Jan 29 07:33:54 CET 2011
"Martin v. Löwis", 28.01.2011 22:49:
> And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
> representation for such a loop. Instead, it should access the str
>> Regarding Cython specifically, the above will still be *possible* under
>> the proposal, given that the memory layout of the strings will still
>> represent the Unicode code points. It will just be trickier to implement
>> in Cython's type system as there is no longer a (user visible) C type
>> representation for those code units.
> There is: Py_UCS4 remains available.
Thanks for that pointer. I had always thought that all "*UCS4*" names were
platform specific and had completely missed that type. It's a lot nicer
than Py_UNICODE because it allows users to fold surrogate pairs back into
the character value.
It's completely missing from the docs, BTW. Google doesn't give me a single
mention for all of docs.python.org, even though it existed at least since
(and likely long before) Cython's oldest supported runtime Python 2.3.
If I had known about that type earlier, I could have ended up making that
the native Unicode character type in Cython instead of bothering with
Py_UNICODE. But this can still be changed I think. Since type inference was
available before native Py_UNICODE support, it's unlikely that users will
have Py_UNICODE written in their code explicitly. So I can make the switch
under the hood.
Just to explain, a native CPython C type is much better than an arbitrary
integer type, because it allows Cython to apply specific coercion rules
from and to Python object types. As currently Py_UNICODE, Py_UCS4 would
obviously coerce from and to a 1 character Unicode string, but it could
additionally handle surrogate pair splitting and combining automatically on
current 16-bit Unicode builds so that you'd get a Unicode string with two
code points on coercion to Python.
>> While I'm somewhat confident that I'll
>> find a way to fix this in Cython, my point is just that this adds a
>> certain level of complexity to C code using the new memory layout that
>> simply wasn't there before.
> Understood. However, I think it is easier than you think it is.
Let's see about the implications once there is an implementation.
More information about the Python-Dev