[Python-Dev] New Py_UNICODE doc (Another Attempt)
Nicholas Bastin
nbastin at opnet.com
Fri May 6 22:20:39 CEST 2005
After reading through the code and the comments in this thread, I
propose the following in the documentation as the definition of
Py_UNICODE:
"This type represents the storage type which is used by Python
internally as the basis for holding Unicode ordinals. Extension module
developers should make no assumptions about the size or native encoding
of this type on any given platform."
The main point here is that extension developers can not safely slam
Py_UNICODE (which it appeared was true when the documentation stated
that it was always 16-bits).
I don't propose that we put this information in the doc, but the
possible internal representations are:
2-byte wchar_t or unsigned short encoded as UTF-16
4-byte wchar_t encoded as UTF-32 (UCS-4)
If you do not explicitly set the configure option, you cannot guarantee
which you will get. Python also does not normalize the byte order of
unicode strings passed into it from C (via PyUnicode_EncodeUTF16, for
example), so it is possible to have UTF-16LE and UTF-16BE strings in
the system at the same time, which is a bit confusing. This may or may
not be worth a mention in the doc (or a patch).
--
Nick
More information about the Python-Dev
mailing list