Re: [Python-checkins] python/dist/src/Objects unicodeobject.c, 2.197, 2.198

tim_one@users.sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory sc8-pr-cvs1:/tmp/cvs-serv17421/Objects
Modified Files: unicodeobject.c Log Message: On c.l.py, Martin v. Löwis said that Py_UNICODE could be of a signed type, so fiddle Jeremy's fix to live with that. Also added more comments.
Note that the implementation will bomb in several places if Py_UNICODE is a signed type. Py_UNICODE was never intended to be a signed type, so the proper fix would be to add logic so that Py_UNICODE gets forced to be an unsigned type. The only case where Py_UNICODE could become signed is via a compiler that defines wchar_t to be signed -- rather unlikely. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 16 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[M.-A. Lemburg]
Note that the implementation will bomb in several places if Py_UNICODE is a signed type.
Py_UNICODE was never intended to be a signed type, so the proper fix would be to add logic so that Py_UNICODE gets forced to be an unsigned type.
Jeremy believed Py_UNICODE was already an unsigned type on his box, and that was the box with the segfaults. I don't know. Comparison with a signed int confused the issues to the point where we gave up and just fixed it <wink>.
The only case where Py_UNICODE could become signed is via a compiler that defines wchar_t to be signed -- rather unlikely.
The C standard requires wchar_t to be an integer type, but doesn't constrain it further to an unsigned integer type. I don't like relying on non-standard assumptions in cases where there's little-to-no cost in not relying on them. For example, the cast I put in with this patch is probably a nop on most boxes, just forcing an unsigned comparison (which must have been the original intent, if Py_UNICODE was assumed to be an unsigned type).

Tim Peters wrote:
[M.-A. Lemburg]
Note that the implementation will bomb in several places if Py_UNICODE is a signed type.
Py_UNICODE was never intended to be a signed type, so the proper fix would be to add logic so that Py_UNICODE gets forced to be an unsigned type.
Jeremy believed Py_UNICODE was already an unsigned type on his box, and that was the box with the segfaults. I don't know. Comparison with a signed int confused the issues to the point where we gave up and just fixed it <wink>.
That sounds more like compiler bug to me. What's bothering me is that such compares are done in other places too, so a more general solution would be better.
The only case where Py_UNICODE could become signed is via a compiler that defines wchar_t to be signed -- rather unlikely.
The C standard requires wchar_t to be an integer type, but doesn't constrain it further to an unsigned integer type. I don't like relying on non-standard assumptions in cases where there's little-to-no cost in not relying on them. For example, the cast I put in with this patch is probably a nop on most boxes, just forcing an unsigned comparison (which must have been the original intent, if Py_UNICODE was assumed to be an unsigned type).
No question there, but wouldn't it be easier to test such a platform and then fallback to "unigned int" in case wchar_t is found to be a signed value ? -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 17 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Wed, Sep 17, 2003 at 10:08:13AM +0200, M.-A. Lemburg wrote:
No question there, but wouldn't it be easier to test such a platform and then fallback to "unigned int" in case wchar_t is found to be a signed value ?
This inaccurate comment created by configure [from 2.3b1] implies that sds/2 already expects wchar_t to be unsigned if it is to be usable: if test "$unicode_size" = "$ac_cv_sizeof_wchar_t" then PY_UNICODE_TYPE="wchar_t" AC_DEFINE(HAVE_USABLE_WCHAR_T, 1, [Define if you have a useable wchar_t type defined in wchar.h; useable means wchar_t must be 16-bit unsigned type. (see Include/unicodeobject.h).]) AC_DEFINE(PY_UNICODE_TYPE,wchar_t) ... but configure doesn't actually check for signedness, and the "16-bit" part is inaccurate. (it must be 16 bits for ucs-2, or 32-bits for ucs-4) Jeff

On Wed, Sep 17, 2003 at 09:59:11AM -0500, Jeff Epler wrote:
This inaccurate comment created by configure [from 2.3b1] implies that _____ already expects wchar_t to be unsigned if it is to be usable:
urm, I meant to refer to Python, not a software product most of you are lucky enough to have never heard of... Jeff

"M.-A. Lemburg" <mal@lemburg.com> writes:
No question there, but wouldn't it be easier to test such a platform and then fallback to "unigned int" in case wchar_t is found to be a signed value ?
Why is it important that Py_UNICODE is unsigned? Regards, Martin

Martin v. Löwis wrote:
"M.-A. Lemburg" <mal@lemburg.com> writes:
No question there, but wouldn't it be easier to test such a platform and then fallback to "unigned int" in case wchar_t is found to be a signed value ?
Why is it important that Py_UNICODE is unsigned?
Because that's what was used as basis in the type implementation as well as the codecs (internal and external). Comparisons simply work differently when you're using a signed type which is also why most compilers warn about this -- but you know that. An signed type also doesn't make much sense for things like character storage -- the sign information is useless and you lose a bit for each character. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 18 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@lemburg.com> writes:
Because that's what was used as basis in the type implementation as well as the codecs (internal and external). Comparisons simply work differently when you're using a signed type which is also why most compilers warn about this -- but you know that.
Yes and no. It appears that you are assuming that Py_UNICODE is always unsigned, however, it is (AFAICT) nowhere documented, and I don't see any strong reason ("it used to be that way" is not a strong reason - code does change over time).
An signed type also doesn't make much sense for things like character storage -- the sign information is useless and you lose a bit for each character.
OTOH, using wchar_t where possible is also valuable. On all systems with a signed wchar_t that we know of, that wchar_t has 32 bits - losing the sign bit does not mean to lose character code points, as Unicode has less than 2**17 code points, anyway. Regards, Martin

[Tim]
Jeremy believed Py_UNICODE was already an unsigned type on his box, and that was the box with the segfaults. I don't know. Comparison with a signed int confused the issues to the point where we gave up and just fixed it <wink>.
[M.-A. Lemburg]
That sounds more like compiler bug to me.
It could be, but if so Jeremy is running on a mainstream Linux+gcc platform and then it's something we can't really wish away. Jeremy, can you tell us what Py_UNICODE resolves to on your box, or give enough details so someone else can figure it out? As I read the C standard, unsigned_int < 256 has to use unsigned comparison, so it's a compiler bug, or I'm misreading the standard, or Jeremy was mistaken in believing Py_UNICODE resolves to an unsigned thingie on his box (we know for sure that the bit pattern 0xcdcdcdcd compared less than 256 on his box; that's obviously what it should do if Py_UNICODE resolves to a signed 4-byte thing on his box, but not otherwise).
What's bothering me is that such compares are done in other places too, so a more general solution would be better.
I'd like to figure out what Jeremy's true problem was first -- we've got a solution to his symptom now, but don't really know why it was necessary.

On Wed, 2003-09-17 at 20:22, Tim Peters wrote:
It could be, but if so Jeremy is running on a mainstream Linux+gcc platform and then it's something we can't really wish away. Jeremy, can you tell us what Py_UNICODE resolves to on your box, or give enough details so someone else can figure it out?
As I read the C standard,
unsigned_int < 256
has to use unsigned comparison, so it's a compiler bug, or I'm misreading the standard, or Jeremy was mistaken in believing Py_UNICODE resolves to an unsigned thingie on his box (we know for sure that the bit pattern 0xcdcdcdcd compared less than 256 on his box; that's obviously what it should do if Py_UNICODE resolves to a signed 4-byte thing on his box, but not otherwise).
I was a little confused by the various UNICODE macros. (Is there a comment block somewhere that explains what they are for?) gcc -E tells me: typedef unsigned int Py_UCS4; typedef wchar_t Py_UNICODE; typedef long int wchar_t; (not necessarily in that order) I got Py_UCS4 and Py_UNICODE confused. The detailed output confirms that Py_UNICODE is a signed long int. Jeremy

[Jeremy Hylton]
I was a little confused by the various UNICODE macros. (Is there a comment block somewhere that explains what they are for?)
Not that I've found. If someone writes one, don't forget the intended difference between PY_UNICODE_TYPE and Py_UNICODE (hint: there isn't a difference <wink>).
gcc -E tells me:
typedef unsigned int Py_UCS4; typedef wchar_t Py_UNICODE; typedef long int wchar_t;
(not necessarily in that order)
I got Py_UCS4 and Py_UNICODE confused. The detailed output confirms that Py_UNICODE is a signed long int.
So that puts an end to the claim that it's unlikely wchar_t will resolve to a signed type. Strangely, while char is a signed type under MSVC, wchar_t is an unsigned type. I expect both differ under gcc, then. At least it's consistent <wink>. Anyway, everywhere the code may be doing a_Py_UNICODE comparison a_(signed)_int is doing something unintended now on your box. "The rules" for mixed-signedness comparison are pretty much a nightmare, especially when you're not sure how many bits are involved on both sides: http://yarchive.net/comp/ansic_broken_unsigned.html MAL's idea of forcing PY_UNICODE_TYPE to resolve to an unsigned type may be the easiest way out.

"Tim Peters" <tim.one@comcast.net> writes:
So that puts an end to the claim that it's unlikely wchar_t will resolve to a signed type. Strangely, while char is a signed type under MSVC, wchar_t is an unsigned type.
I think people have learned, over time, that negative characters are evil. So MS chose to make wchar_t unsigned, as, for 2-byte Unicode, you might otherwise run into negative characters. For gcc, people probably also agree that negative characters are evil, but using long int causes no harm here: All assigned characters are still positive, as Unicode goes only up to 2**21. Regards, Martin

[martin@v.loewis.de]
I think people have learned, over time, that negative characters are evil. So MS chose to make wchar_t unsigned, as, for 2-byte Unicode, you might otherwise run into negative characters. For gcc, people probably also agree that negative characters are evil, but using long int causes no harm here: All assigned characters are still positive, as Unicode goes only up to 2**21.
That's fine for gcc then, so long as we don't branch on the contents of uninitialized memory, and we just fixed a bug of that sort. If Py_UNICODE ever resolves to a signed 2-byte type, though, the sign bit is still in play for legit contents.

"Tim Peters" <tim.one@comcast.net> writes:
That's fine for gcc then, so long as we don't branch on the contents of uninitialized memory, and we just fixed a bug of that sort.
If Py_UNICODE ever resolves to a signed 2-byte type, though, the sign bit is still in play for legit contents.
Right. I agree that we don't want to use a signed 2-byte type. Given that only MS uses a two-byte wchar_t (to my knowledge), and that theirs is unsigned, having an autoconf test to detect that this configuration is not supported is more than enough. Regards, Martin

Martin v. Löwis wrote:
"Tim Peters" <tim.one@comcast.net> writes:
That's fine for gcc then, so long as we don't branch on the contents of uninitialized memory, and we just fixed a bug of that sort.
If Py_UNICODE ever resolves to a signed 2-byte type, though, the sign bit is still in play for legit contents.
Right. I agree that we don't want to use a signed 2-byte type. Given that only MS uses a two-byte wchar_t (to my knowledge), and that theirs is unsigned, having an autoconf test to detect that this configuration is not supported is more than enough.
Since wchar_t is the only case where a signed type can pop up, why not extend the autoconf test to check for signedness and then reject signed wchar_t value as not-usable (ie. undefine HAVE_USABLE_WCHAR_T). It looks to me as if this would resolve the problem once and for all. Signed values simply cause too many problems for this kind of application. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 19 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@lemburg.com> writes:
Since wchar_t is the only case where a signed type can pop up, why not extend the autoconf test to check for signedness and then reject signed wchar_t value as not-usable (ie. undefine HAVE_USABLE_WCHAR_T).
Because that would exclude a number of relevant systems where wchar_t would be usable. Regards, Martin

[M.-A. Lemburg]
Since wchar_t is the only case where a signed type can pop up, why not extend the autoconf test to check for signedness and then reject signed wchar_t value as not-usable (ie. undefine HAVE_USABLE_WCHAR_T).
[martin@v.loewis.de]
Because that would exclude a number of relevant systems where wchar_t would be usable.
So what if MAL ammened his suggestion to reject signed 2-byte wchar_t value as not-usable +++++++ ?

"Tim Peters" <tim.one@comcast.net> writes:
So what if MAL ammened his suggestion to
reject signed 2-byte wchar_t value as not-usable +++++++
?
That would be a very sensible suggestion. Regards, Martin

Tim Peters wrote:
[M.-A. Lemburg]
Since wchar_t is the only case where a signed type can pop up, why not extend the autoconf test to check for signedness and then reject signed wchar_t value as not-usable (ie. undefine HAVE_USABLE_WCHAR_T).
[martin@v.loewis.de]
Because that would exclude a number of relevant systems where wchar_t would be usable.
So what if MAL ammened his suggestion to
reject signed 2-byte wchar_t value as not-usable +++++++ ?
That would not solve the problem. Note that we have proper conversion routines that allow converting between wchar_t and Py_UNICODE. These routines must be used for conversions anyway (even if Py_UNICODE and wchar_t happen to be the same type), so from a programmer perspective changing Py_UNICODE to be unsigned won't be noticed and we don't lose anything much. Again, I don't see the point in using a signed type for data that doesn't have any concept of signed values. It's just bad design and we shouldn't try to go down the same route if we don't have to. The Unicode implementation has always defined Py_UNICODE to be an unsigned type; see the Unicode PEP 100: """ Internal Format The internal format for Unicode objects should use a Python specific fixed format <PythonUnicode> implemented as 'unsigned short' (or another unsigned numeric type having 16 bits). Byte order is platform dependent. ... The configure script should provide aid in deciding whether Python can use the native wchar_t type or not (it has to be a 16-bit unsigned type). """ Python can also deal with UCS4 now, but the concept remains the same. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 21 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[Tim]
So what if MAL ammened his suggestion to
reject signed 2-byte wchar_t value as not-usable +++++++ ?
[M.-A. Lemburg]
That would not solve the problem.
Then what is the problem, specifically? I thought you agreed with Martin that a signed 32-bit type doesn't hurt, since the sign bit remains clear then in all cases of Unicode data.
Note that we have proper conversion routines that allow converting between wchar_t and Py_UNICODE. These routines must be used for conversions anyway (even if Py_UNICODE and wchar_t happen to be the same type), so from a programmer perspective changing Py_UNICODE to be unsigned won't be noticed and we don't lose anything much.
Again, I don't see the point in using a signed type for data that doesn't have any concept of signed values. It's just bad design and we shouldn't try to go down the same route if we don't have to.
I don't know why Martin favors wchar_t when possible. The answer to that isn't clear. The answer to why there's an intractable problem if wchar_t happens to be a signed type > 2 bytes also isn't clear.
The Unicode implementation has always defined Py_UNICODE to be an unsigned type; see the Unicode PEP 100:
""" Internal Format
The internal format for Unicode objects should use a Python specific fixed format <PythonUnicode> implemented as 'unsigned short' (or another unsigned numeric type having 16 bits). Byte order is platform dependent.
...
The configure script should provide aid in deciding whether Python can use the native wchar_t type or not (it has to be a 16-bit unsigned type). """
Python can also deal with UCS4 now, but the concept remains the same.
Well, it doesn't have to be a 16-bit type either, even in a UCS2 build, and we had a long argument about that one before, because a particular Cray system didn't have any 16-bit type and the Unicode code wasn't working there. That got repaired when I rewrote the few bits of code that assumed "exactly 16 bits" to live with the weaker "at least 16 bits". In this iteration, Martin agreed that a signed 16-bit wchar_t can be rejected. The question remaining is what actual problem exists when there's a signed wchar_t exceeding 16 bits. Since Jeremy is running on exactly such a system, and the tests pass for him, there's no *obvious* problem with it (the segfault he experienced was due to reading uninitialized memory, and that was a bug, and that's been fixed).

Tim Peters wrote:
I don't know why Martin favors wchar_t when possible. The answer to that isn't clear.
If the wchar_t is "usable", some routines (notably PU_AsWideChar) are slightly more efficient, so I'd like to make wchar_t "usable" as much as possible. Regards, Martin

[Tim]
I don't know why Martin favors wchar_t when possible. The answer to that isn't clear.
[Martin v. Löwis]
If the wchar_t is "usable", some routines (notably PU_AsWideChar) are slightly more efficient, so I'd like to make wchar_t "usable" as much as possible.
OK. So is there an end to this thread <0.9 wink>? At the moment, it appears there's no identified reason to care about signedness of a greater-than 16-bit type, good reason to insist that a 16-bit type is unsigned, and that it's desirable for HAVE_USABLE_WCHAR_T to get defined when possible. What more does it take to bury this? If it's Unixish config chagnes, they won't be coming from me (the Windows build uses an unsigned 16-bit wchar_t).

Tim Peters wrote:
[Tim]
I don't know why Martin favors wchar_t when possible. The answer to that isn't clear.
[Martin v. Löwis]
If the wchar_t is "usable", some routines (notably PU_AsWideChar) are slightly more efficient, so I'd like to make wchar_t "usable" as much as possible.
OK. So is there an end to this thread <0.9 wink>? At the moment, it appears there's no identified reason to care about signedness of a greater-than 16-bit type,
Sure there is: first of all, having a single type that can be signed on some platforms and unsigned on others is a bad thing per se and second the 32-bit signed wchar_t value was what triggered this thread in the first place.
good reason to insist that a 16-bit type is unsigned, and that it's desirable for HAVE_USABLE_WCHAR_T to get defined when possible. What more does it take to bury this? If it's Unixish config chagnes, they won't be coming from me (the Windows build uses an unsigned 16-bit wchar_t).
That's what it takes, right. I'll work on it. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 22 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
good reason to insist that a 16-bit type is unsigned, and that it's desirable for HAVE_USABLE_WCHAR_T to get defined when possible. What more does it take to bury this? If it's Unixish config chagnes, they won't be coming from me (the Windows build uses an unsigned 16-bit wchar_t).
That's what it takes, right. I'll work on it.
While working on the config changes, I noticed that Python now defaults to UCS4 when it find a Tcl/Tk version that supports UCS4... I can't say that I particularly like this, since the config script now makes an implicit choice based on the third-party software configuration with consequences that are not made obvious for the user. E.g. on platforms that happen to have Tcl/tk installed with UCS4 configuration, Python will compile using UCS4 (regardeless of whether the user wants to use Tcl/Tk or not), on system that don't have such Tcl/tk installation, Python compiles using UCS2. I'd suggest to make the UCS4 choice explicit again. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 22 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Mon, Sep 22, 2003 at 01:18:32PM +0200, M.-A. Lemburg wrote:
While working on the config changes, I noticed that Python now defaults to UCS4 when it find a Tcl/Tk version that supports UCS4...
I can't say that I particularly like this, since the config script now makes an implicit choice based on the third-party software configuration with consequences that are not made obvious for the user.
E.g. on platforms that happen to have Tcl/tk installed with UCS4 configuration, Python will compile using UCS4 (regardeless of whether the user wants to use Tcl/Tk or not), on system that don't have such Tcl/tk installation, Python compiles using UCS2.
I'd suggest to make the UCS4 choice explicit again.
See the (complete lack of) discussion on the bug at http://python.org/sf/798202 I can only say that this is a convenient patch for me, and probably for any redhat9 user who has not built his own tcl/tk. On the other hand, this patch is unlikely to actually affect anybody on a different platform. I don't think it's such a bad idea to have Python detect the configuration it must use to make a very important, bundled extension work properly. Would you feel any different if a different patch was added, which made --enable-unicode=tcl enable the behavior of this patch? Jeff

Jeff Epler wrote:
On Mon, Sep 22, 2003 at 01:18:32PM +0200, M.-A. Lemburg wrote:
While working on the config changes, I noticed that Python now defaults to UCS4 when it find a Tcl/Tk version that supports UCS4...
I can't say that I particularly like this, since the config script now makes an implicit choice based on the third-party software configuration with consequences that are not made obvious for the user.
E.g. on platforms that happen to have Tcl/tk installed with UCS4 configuration, Python will compile using UCS4 (regardeless of whether the user wants to use Tcl/Tk or not), on system that don't have such Tcl/tk installation, Python compiles using UCS2.
I'd suggest to make the UCS4 choice explicit again.
See the (complete lack of) discussion on the bug at http://python.org/sf/798202
I can only say that this is a convenient patch for me, and probably for any redhat9 user who has not built his own tcl/tk. On the other hand, this patch is unlikely to actually affect anybody on a different platform.
I don't think it's such a bad idea to have Python detect the configuration it must use to make a very important, bundled extension work properly. Would you feel any different if a different patch was added, which made --enable-unicode=tcl enable the behavior of this patch?
Yes :-) ... "explicit is better than implicit" -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 22 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Mon, 2003-09-22 at 08:04, Jeff Epler wrote:
I don't think it's such a bad idea to have Python detect the configuration it must use to make a very important, bundled extension work properly. Would you feel any different if a different patch was added, which made --enable-unicode=tcl enable the behavior of this patch?
Seems just as easy to document that on RH9 with the default tcl/tk, you need to include --enable-unicode=ucs4. -Barry

[Tim]
At the moment, it appears there's no identified reason to care about signedness of a greater-than 16-bit type,
[M.-A. Lemburg]
Sure there is: first of all, having a single type that can be signed on some platforms and unsigned on others is a bad thing per se
We inherit that from C, though -- it's fine by C if wchar_t is signed or unsigned, just as it refused to define the signedness of char.
and second the 32-bit signed wchar_t value was what triggered this thread in the first place.
What triggered the thread originally was a segfault due to the code making a branch based on the content of uninitialized memory. The code clearly didn't *think* it was reading up random heap bits, so that was a bug regardless of wchar_t's signedness. That wchar_t happened to be a signed 32-bit type on Jeremy's box is what uncovered the read-uninitialized-memory bug. If there's no other code vulnerable to bad behavior if wchar_t is a signed 32-bit type (nobody has identified another case), objections to it being signed anyway seem technically groundless. Martin did give a technical reason (efficiency) for wanting to continue to use wchar_t on Jeremy's system.

Tim Peters wrote:
[Tim]
At the moment, it appears there's no identified reason to care about signedness of a greater-than 16-bit type,
[M.-A. Lemburg]
Sure there is: first of all, having a single type that can be signed on some platforms and unsigned on others is a bad thing per se
We inherit that from C, though -- it's fine by C if wchar_t is signed or unsigned, just as it refused to define the signedness of char.
It maybe fine for C... it is not for the Unicode implementation since that has always assumed Py_UNICODE to be unsigned. This is fixed now.
and second the 32-bit signed wchar_t value was what triggered this thread in the first place.
What triggered the thread originally was a segfault due to the code making a branch based on the content of uninitialized memory. The code clearly didn't *think* it was reading up random heap bits, so that was a bug regardless of wchar_t's signedness.
True, but the test (unicode->str[0] < 256) is what revealed a second bug and that's what we've been discussing all along.
That wchar_t happened to be a signed 32-bit type on Jeremy's box is what uncovered the read-uninitialized-memory bug.
If there's no other code vulnerable to bad behavior if wchar_t is a signed 32-bit type (nobody has identified another case), objections to it being signed anyway seem technically groundless.
There are more comparisons of the above type in the code and even worse: it is documented that Py_UNICODE is unsigned, so it's very likely that code external to the Python distribution such as codec packages or applications talking to libraries use that assumption as well.
Martin did give a technical reason (efficiency) for wanting to continue to use wchar_t on Jeremy's system.
Python won't be using wchar_t on those systems anymore, so the problem is solved and the original intent restored. If efficiency matters programmers are always free to cast Py_UNICODE to wchar_t on these systems for fast read-only access. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Sep 22 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
participants (7)
-
"Martin v. Löwis"
-
Barry Warsaw
-
Jeff Epler
-
Jeremy Hylton
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Tim Peters