[Python-Dev] unicode/string asymmetries

Martin v. Loewis martin@v.loewis.de
Wed, 9 Jan 2002 22:11:50 +0100


> Huh, what did I miss? Why is PyArg_Parse deprecated, and by what
> should it be replaced?

Not precisely; METH_OLDARGS and its combination with Py_ArgParse is
deprecated, use PyArg_ParseTuple instead. That still leaves a few uses
of PyArg_Parse, but these are really to special to worry about.

> > and I doubt you have Py_UNICODE* often enough to need
> > it to pass to Py_BuildValue.

> Martin, have you ever wrapped any Unicode API's? (As opposed to
> using unicode as a purely internal datatype, which you clearly know
> a lot about).

Certainly, I've tried providing libiconv interfacing. I was strongly
pushing the notion that Py_UNICODE is equal to wchar_t on all
platforms, that notion was unfortunately rejected.

As a result, using wchar_t together with Python Unicode objects is
difficult. No existing C library reliably accepts Py_UNICODE*, if
anything, they accept wchar_t* (although Microsoft, and apparently
also Apple, manages to use yet another type, further complicating
issues).

There are exceptions: on some platforms, Py_UNICODE currently is equal
to wchar_t, like Windows. That may change in the future, if people
request full Unicode support (i.e. a 4-byte Unicode type) - then
Py_UNICODE might differ from WCHAR even on Windows. At that time, any
code that currently assumes they are equal will break. So I'd rather
educate people about the issues now than having to come up with
work-arounds when they eventually run into them.


> Thomas' question are similar to mine from last week, and Neil's are
> related too. All the niceties we have for strings (optional ones
> with z, autoconversion from unicode, s# to get the size) are missing
> for unicode, and that's a pain when you're wrapping an existing C
> api.

These problems are inherent in the subject matter: the C support of
Unicode, and its relationship to the char type is inherently
inconsistent.

If Python would offer a struct code that translates into wchar_t, he'd
get away with that on Window. However, it seemed to me that the
specific structure was primarily used in files, so code that tries to
fill it should use formats that are platform-independent. For the
integer types, that means you cannot just use the "i" format, but you
need to know what the integer range is (i.e. 8, 16, 32, or 64
bits). Likewise, for strings, you need to know what the width of each
character, and the endianness is.

Furthermore, apart from Windows, I doubt *anybody* puts wide strings
in platform encoding into files. I'd hope anybody else is so smart to
clearly define the encoding used when representing Unicode strings in
byte-oriented files, streams, and structures.

Regards,
Martin