[Python-3000] Unicode and OS strings
Guido van Rossum
guido at python.org
Tue Sep 18 23:29:41 CEST 2007
On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Guido has stated that the
> internal representation used by Python strings is a sequence of
> Unicode code units, not characters. I don't think that's reached the
> status of "pronouncement" yet, but you will probably need a PEP to get
> the guarantees you want.
I think of this as cast in stone; we can't reasonably guarantee more
if we want to be compatible with the UTF-16 (*) Unicode
representations used on Windows and in Java. How much more
pronouncement do you want?
(*) I'm not at all sure that it's called that -- you guys keep asking
trick questions based on terminology that's only clear to people who
have read the Unicode standard several times forwards and backwards. I
mean the representation that uses 16-bit values, where characters >=
2**16 are represented as two 16-bit "surrogate" values. (I hope I at
least have the 'surrogate' thing right this time.)
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-3000
mailing list