[Python-3000] Unicode and OS strings

Guido van Rossum guido at python.org
Tue Sep 18 23:29:41 CEST 2007


On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Guido has stated that the
> internal representation used by Python strings is a sequence of
> Unicode code units, not characters.  I don't think that's reached the
> status of "pronouncement" yet, but you will probably need a PEP to get
> the guarantees you want.

I think of this as cast in stone; we can't reasonably guarantee more
if we want to be compatible with the UTF-16 (*) Unicode
representations used on Windows and in Java. How much more
pronouncement do you want?

(*) I'm not at all sure that it's called that -- you guys keep asking
trick questions based on terminology that's only clear to people who
have read the Unicode standard several times forwards and backwards. I
mean the representation that uses 16-bit values, where characters >=
2**16 are represented as two 16-bit "surrogate" values. (I hope I at
least have the 'surrogate' thing right this time.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list