[Python-Dev] New Py_UNICODE doc (Another Attempt)

Nicholas Bastin nbastin at opnet.com
Fri May 6 22:20:39 CEST 2005


After reading through the code and the comments in this thread, I 
propose the following in the documentation as the definition of 
Py_UNICODE:

"This type represents the storage type which is used by Python 
internally as the basis for holding Unicode ordinals.  Extension module 
developers should make no assumptions about the size or native encoding 
of this type on any given platform."

The main point here is that extension developers can not safely slam 
Py_UNICODE (which it appeared was true when the documentation stated 
that it was always 16-bits).

I don't propose that we put this information in the doc, but the 
possible internal representations are:

2-byte wchar_t or unsigned short encoded as UTF-16
4-byte wchar_t encoded as UTF-32 (UCS-4)

If you do not explicitly set the configure option, you cannot guarantee 
which you will get.  Python also does not normalize the byte order of 
unicode strings passed into it from C (via PyUnicode_EncodeUTF16, for 
example), so it is possible to have UTF-16LE and UTF-16BE strings in 
the system at the same time, which is a bit confusing.  This may or may 
not be worth a mention in the doc (or a patch).

--
Nick



More information about the Python-Dev mailing list