
[Tim]
So what if MAL ammened his suggestion to
reject signed 2-byte wchar_t value as not-usable +++++++ ?
[M.-A. Lemburg]
That would not solve the problem.
Then what is the problem, specifically? I thought you agreed with Martin that a signed 32-bit type doesn't hurt, since the sign bit remains clear then in all cases of Unicode data.
Note that we have proper conversion routines that allow converting between wchar_t and Py_UNICODE. These routines must be used for conversions anyway (even if Py_UNICODE and wchar_t happen to be the same type), so from a programmer perspective changing Py_UNICODE to be unsigned won't be noticed and we don't lose anything much.
Again, I don't see the point in using a signed type for data that doesn't have any concept of signed values. It's just bad design and we shouldn't try to go down the same route if we don't have to.
I don't know why Martin favors wchar_t when possible. The answer to that isn't clear. The answer to why there's an intractable problem if wchar_t happens to be a signed type > 2 bytes also isn't clear.
The Unicode implementation has always defined Py_UNICODE to be an unsigned type; see the Unicode PEP 100:
""" Internal Format
The internal format for Unicode objects should use a Python specific fixed format <PythonUnicode> implemented as 'unsigned short' (or another unsigned numeric type having 16 bits). Byte order is platform dependent.
...
The configure script should provide aid in deciding whether Python can use the native wchar_t type or not (it has to be a 16-bit unsigned type). """
Python can also deal with UCS4 now, but the concept remains the same.
Well, it doesn't have to be a 16-bit type either, even in a UCS2 build, and we had a long argument about that one before, because a particular Cray system didn't have any 16-bit type and the Unicode code wasn't working there. That got repaired when I rewrote the few bits of code that assumed "exactly 16 bits" to live with the weaker "at least 16 bits". In this iteration, Martin agreed that a signed 16-bit wchar_t can be rejected. The question remaining is what actual problem exists when there's a signed wchar_t exceeding 16 bits. Since Jeremy is running on exactly such a system, and the tests pass for him, there's no *obvious* problem with it (the segfault he experienced was due to reading uninitialized memory, and that was a bug, and that's been fixed).