[Python-3000] PEP Draft: Enhancing the buffer protcol
Josiah Carlson
jcarlson at uci.edu
Wed Feb 28 19:55:21 CET 2007
Travis Oliphant <oliphant.travis at ieee.org> wrote:
> I think you are right. In the discussions for unifying string/unicode I
> really like the proposals that are leaning toward having a unicode
> object be an immutable string of either ucs-1, ucs-2, or ucs-4 depending
> on what is in the string.
Except that its not going to happen. The width of the unicode
representation is going to be fixed at compile time, generally utf-16 or
ucs-4. I say utf-16 because the representation allows for surrogate
pairs, etc., but each value of the pair are considered a "character",
where as (according to my potentially flawed memory of reading the spec)
ucs-2 doesn't allow for surrogates.
Note that I previously offered an overlay structure that could support
the O(logn) time access of arbitrary full characters regardless of
encoding (utf-8, utf-16 or ucs-4) using O(logn) space, but it was
decided by Guido that Python should return partial character (half of a
surrogate pair) rather than offer non-constant character access time.*
- Josiah
* As a side note, the space and time is really a function of how often
surrogates or their equivalent in utf-8, etc., occurred. In worst-case
O(logn) for both, but is actually a function of the structure of
occurrances of the non-constant character lengths.
More information about the Python-3000
mailing list