Re: [Python-checkins] python/dist/src/Objects unicodeobject.c, 2.204, 2.205

perky@users.sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory sc8-pr-cvs1:/tmp/cvs-serv1651/Objects
Modified Files: unicodeobject.c Log Message: SF #859573: Reduce compiler warnings on gcc 3.2 and above.
Index: unicodeobject.c
*** 2204,2208 ****
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1 && *(unsigned char*)s < 256) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1); --- 2212,2216 ----
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1);
This "fix" doesn't look right. Please check.
*** 2406,2409 **** --- 2414,2421 ---- else if (*p<1000) repsize += 2+3+1;
- #ifndef Py_UNICODE_WIDE
else
repsize += 2+4+1;
- #else else if (*p<10000) repsize += 2+4+1;
*** 2414,2417 **** --- 2426,2430 ---- else repsize += 2+7+1;
- #endif } requiredsize = respos+repsize+(endp-collend);

On Fri, Dec 19, 2003 at 09:30:27AM +0100, M.-A. Lemburg wrote:
perky@users.sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory sc8-pr-cvs1:/tmp/cvs-serv1651/Objects
Modified Files: unicodeobject.c Log Message: SF #859573: Reduce compiler warnings on gcc 3.2 and above.
Index: unicodeobject.c
*** 2204,2208 ****
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1 && *(unsigned char*)s < 256) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1); --- 2212,2216 ----
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1);
This "fix" doesn't look right. Please check.
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Objects/unicodeobject.o Objects/unicodeobject.c Objects/unicodeobject.c: In function `PyUnicodeUCS2_DecodeLatin1': Objects/unicodeobject.c:2214: warning: comparison is always true due to limited range of data type
AFAIK, *(unsigned char*)s is always smaller than 256. Also decoding latin-1 can be done by just casting it into Py_UNICODE.
I'm sorry but can you explain more?
Hye-Shik

On Fri, Dec 19, 2003 at 11:03:41AM +0100, Fredrik Lundh wrote:
Hye-Shik Chang wrote:
AFAIK, *(unsigned char*)s is always smaller than 256.
except when it isn't. see the ANSI C spec for details.
Ah. I found. I'm very surprised for that. Thank you! :-) BTW, do we really support architectures with 9bits-sized char?
Hye-Shik

Hye-Shik Chang perky@i18n.org writes:
On Fri, Dec 19, 2003 at 11:03:41AM +0100, Fredrik Lundh wrote:
Hye-Shik Chang wrote:
AFAIK, *(unsigned char*)s is always smaller than 256.
except when it isn't. see the ANSI C spec for details.
Ah. I found. I'm very surprised for that. Thank you! :-) BTW, do we really support architectures with 9bits-sized char?
On some kinds of Cray that Python has been built on in the past, I think the smallest addressable unit of memory is 64 bits. So, not quite 96, but getting on that way. I don't think we want to make the lives of people porting to such architectures any harder than it already is...
Cheers, mwh

Michael Hudson wrote:
Hye-Shik Chang perky@i18n.org writes:
[...]
BTW, do we really support architectures with 9bits-sized char?
[...]
I don't think we want to make the lives of people porting to such architectures any harder than it already is...
TI make chips where the smallest addressable unit is 16-bits and sizeof(char) == sizeof(int) == 16 bits == 1 byte due to the way the C standard is written.
I don't think Python is ported to any such chip at present (the one I use is a DSP, and I would seriously question the sanity of anyone who tried to run Python on one of these critters), but it's a possibility that shouldn't be ignored. Porting to such a machine would be rather entertaining (sizeof() gets a _lot_ of work in the code for that DSP).
Cheers, Nick.

Nick> Michael Hudson wrote: >> Hye-Shik Chang perky@i18n.org writes: Nick> [...] >>> BTW, do we really support architectures with 9bits-sized char? Nick> [...] >> I don't think we want to make the lives of people porting to such >> architectures any harder than it already is...
Nick> TI make chips where the smallest addressable unit is 16-bits and Nick> sizeof(char) == sizeof(int) == 16 bits == 1 byte due to the way Nick> the C standard is written.
It seems to me the right thing to do is to cook up a test in the configure script which checks the number of bits in an unsigned char and sets a cpp macro which the code in question then uses to compile the fast case for 8-bit chars and the slow case otherwise.
Skip

Skip> It seems to me the right thing to do is to cook up a test in the Skip> configure script which checks the number of bits in an unsigned Skip> char ...
Better yet, let's use CHAR_BIT:
#if defined(CHAR_BIT) && CHAR_BIT == 8 ... fast case ... #else ... slow case ... #endif

AFAIK, *(unsigned char*)s is always smaller than 256.
except when it isn't. see the ANSI C spec for details.
Ah. I found. I'm very surprised for that. Thank you! :-) BTW, do we really support architectures with 9bits-sized char?
I would expect that a lot of our code assumes 8-bit characters, and I personally wouldn't mind if Python was limited to such platforms. They aren't very important for attracting new users, and certainly they don't seem to be a growing kind of platform... (Probably because so much other software makes the same assumption. :-)
So IMO your fix is fine.
--Guido van Rossum (home page: http://www.python.org/~guido/)

On 19-dec-03, at 16:39, Guido van Rossum wrote:
AFAIK, *(unsigned char*)s is always smaller than 256.
except when it isn't. see the ANSI C spec for details.
Ah. I found. I'm very surprised for that. Thank you! :-) BTW, do we really support architectures with 9bits-sized char?
I would expect that a lot of our code assumes 8-bit characters, and I personally wouldn't mind if Python was limited to such platforms.
Then there is a lot of code that could be tuned for this. Since I'm using gcc 3.3 (which came with OSX 10.3) I get lots of warnings about comparisons that are always true due to the size of the operands. I looked at a couple of these and I think they were all char-related. -- Jack Jansen, Jack.Jansen@cwi.nl, http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman

[Guido]
I would expect that a lot of our code assumes 8-bit characters, and I personally wouldn't mind if Python was limited to such platforms. They aren't very important for attracting new users, and certainly they don't seem to be a growing kind of platform... (Probably because so much other software makes the same assumption. :-)
Fine by me too.
The first mainframe I used was a Univac 1108. There were a *lot* of competing HW architectures at that time, and manufacturers didn't agree about character size any more than they agreed about floating-point format or semantics, or the natural size of "a word". Univac was forward-looking, though: they didn't want their hardware to become obsolete if a different character size than the one they preferred clicked, so a control bit in the CPU could be set to treat their 36-bit words as either 6 6-bit characters, or as 4 9-bit characters. It worked! We're *still* equally comfortable with 6-bit bytes as with 9-bit bytes <wink>.
I was betting on 6-bit bytes at the time, because that also worked well with CDC's 60-bit words. FORTRAN didn't even admit to the existence of lower case at the time, so 64 characters was way more than enough for anything anyone really needed to say to a computer.
half-the-bits-in-these-new-fangled-bytes-are-just-wasted-ly y'rs - tim

[Guido]
I would expect that a lot of our code assumes 8-bit characters, and I personally wouldn't mind if Python was limited to such platforms. They aren't very important for attracting new users, and certainly they don't seem to be a growing kind of platform... (Probably because so much other software makes the same assumption. :-)
Fine by me too.
The first mainframe I used was a Univac 1108. There were a *lot* of competing HW architectures at that time, and manufacturers didn't agree about character size any more than they agreed about floating-point format or semantics, or the natural size of "a word". Univac was forward-looking, though: they didn't want their hardware to become obsolete if a different character size than the one they preferred clicked, so a control bit in the CPU could be set to treat their 36-bit words as either 6 6-bit characters, or as 4 9-bit characters. It worked! We're *still* equally comfortable with 6-bit bytes as with 9-bit bytes <wink>.
I was betting on 6-bit bytes at the time, because that also worked well with CDC's 60-bit words. FORTRAN didn't even admit to the existence of lower case at the time, so 64 characters was way more than enough for anything anyone really needed to say to a computer.
half-the-bits-in-these-new-fangled-bytes-are-just-wasted-ly y'rs - tim
I would think the lesson to be learned from this is that one should not lock the software into any particular number of bits per character. The coming flood of 64 bit machines could make 16 bit unicode attractive. It's an ever more global world and "we" should keep in mind that in the next decade most of the world's programming is going to be done in India and China if American corporations have their way.
soon-to-be-looking-for-the-carton-the-mainframe-came-in-to-live-ingly-y'rs
Dave LeBlanc Seattle, WA USA

"David LeBlanc" whisper@oz.net writes:
I would think the lesson to be learned from this is that one should not lock the software into any particular number of bits per character. The coming flood of 64 bit machines could make 16 bit unicode attractive.
You are talking about an entirely different issue here. This thread is about the number of bits in a "char", which, in C, is the same thing as a "byte". The number of bits for a "character" is independent.
16-bit Unicode is attractive already, although it is dying, to make place for 32-bit Unicode. However, new 64-bit architectures will make sure they support an 8-bit data type, and compiler vendors will make sure that "char" maps to that 8-bit data type (most likely, they also will make char signed by default). There is just too much software that breaks if you could not address bytes anymore. Primarily, the entire networking interfaces would break down, which is a risk that new architectures are unlikely to take.
It's an ever more global world and "we" should keep in mind that in the next decade most of the world's programming is going to be done in India and China if American corporations have their way.
Certainly, but unrelated to the issue at hand.
Regards, Martin

Tim> [Guido] >> I would expect that a lot of our code assumes 8-bit characters, and I >> personally wouldn't mind if Python was limited to such platforms.
Tim> Fine by me too.
Then how about adding
#if UCHAR_MAX != 255 #error "Python's source code currently assumes 8-bit characters." #endif
right after the HAVE_LIMITS_H test?
Skip

[Skip Montanaro]
Then how about adding
#if UCHAR_MAX != 255 #error "Python's source code currently assumes 8-bit characters." #endif
right after the HAVE_LIMITS_H test?
I wouldn't object. It should probably then also have
#ifndef UCHAR_MAX #error ... #endif
right before it, and stringobject.c's
#if !defined(HAVE_LIMITS_H) && !defined(UCHAR_MAX) #define UCHAR_MAX 255 #endif
should go away.

>> #if UCHAR_MAX != 255 >> #error "Python's source code currently assumes 8-bit characters." >> #endif >> >> right after the HAVE_LIMITS_H test?
Tim> I wouldn't object. It should probably then also have
Tim> #ifndef UCHAR_MAX Tim> #error ... Tim> #endif
Isn't that supposed to always be defined in limits.h or is UCHAR_MAX not a standard macro?
Tim> right before it, and stringobject.c's
Tim> #if !defined(HAVE_LIMITS_H) && !defined(UCHAR_MAX) Tim> #define UCHAR_MAX 255 Tim> #endif
Tim> should go away.
Sounds like a plan. I modify my local source and see if it affects anything.
Skip

[Skip Montanaro]
Isn't that supposed to always be defined in limits.h or is UCHAR_MAX not a standard macro?
Yes, it's supposed to be there. OTOH, so is limits.h, i.e. the HAVE_LIMITS_H test shouldn't be necessary either. So they're just sanity checks. But you're right, if UCHAR_MAX isn' defined, I was thinking the preprocessor would expand
#if UCHAR_MAX != 255
to
#if != 255
and then the error message would be incomprehensible. But unknown names in preprocessor conditionals actually get replaced by 0, and
#if 0 != 255
does just as well.
Sounds like a plan. I modify my local source and see if it affects anything.
It should work fine.

>> Sounds like a plan. I modify my local source and see if it affects >> anything.
Tim> It should work fine.
Done. UCHAR_MAX is now required in Python.h, and it's required to be 255. That may open up a couple (small) optimization opportunities (places where gcc 3.3 says explicit tests against char values will always be true or false). I'll leave it for others to make those tweaks if they so desire. Note that some of those gcc messages are probably related to larger types (e.g. unsigned short) or depend on the presence or absence of other macro definitions (e.g. Py_UNICODE_WIDE). Should you wish to code any of these speedups, make sure you get the cpp tests right. Guido won't forgive you if you don't. <wink>
Skip

On Mon, Dec 22, 2003 at 10:50:24AM -0600, Skip Montanaro wrote:
>> Sounds like a plan. I modify my local source and see if it affects >> anything. Tim> It should work fine.
Done. UCHAR_MAX is now required in Python.h, and it's required to be 255.
Thanks!
That may open up a couple (small) optimization opportunities (places where gcc 3.3 says explicit tests against char values will always be true or false).I'll leave it for others to make those tweaks if they so desire. Note that some of those gcc messages are probably related to larger types (e.g. unsigned short) or depend on the presence or absence of other macro definitions (e.g. Py_UNICODE_WIDE). Should you wish to code any of these speedups, make sure you get the cpp tests right. Guido won't forgive you if you don't. <wink>
I see. Hehe :-)
BTW, I wonder whether we have any good way to get C integer types with explicitly defined size such as u_int16_t or int64_t of POSIX. I got a report from an OpenBSD/sparc64 user that he want to change an unsigned int variable to int type which is used for socket.inet_aton. (Modules/socketmodule.c:2887). But I think int type isn't appropriate due to fearing its being 64-bits on some 64bit platforms. Also in_addr_t, POSIX return type of inet_addr(3), is not so portable. If we have types like u_int32_t, it will be very helpful to cope with these sort of issues.
Hye-Shik

[Hye-Shik Chang]
BTW, I wonder whether we have any good way to get C integer types with explicitly defined size such as u_int16_t or int64_t of POSIX.
We do not, and C doesn't guarantee that any such types exist. For example, while the Crays we recently talked about do support the illusion of 8-bit char, they have no 16-bit or 32-bit integral types (just 8 and 64); that's fine by the C standard.
C99 defines a large pile of names, for things like "exactly 16 bits *if* such a thing exists", "smallest integral type holding at least 16 bits (this must exist)", and "fastest integral type holding at least 16 bits (which also must exist)". Since not all C compilers support these names yet, they can't be used directly in Python. If one is needed, it's intended to be added, in pyport.h, following this comment:
/* typedefs for some C9X-defined synonyms for integral types. * * The names in Python are exactly the same as the C9X names, except * with a Py_ prefix. Until C9X is universally implemented, this is the * only way to ensure that Python gets reliable names that don't * conflict with names in non-Python code that are playing their own * tricks to define the C9X names. * * NOTE: don't go nuts here! Python has no use for *most* of the C9X * integral synonyms. Only define the ones we actually need. */
... If we have types like u_int32_t, it will be very helpful to cope with these sort of issues.
Not really -- any use of an "exactly N bits" name will make the code uncompilable on some platform.

"Tim Peters" tim.one@comcast.net writes:
[Skip Montanaro]
Isn't that supposed to always be defined in limits.h or is UCHAR_MAX not a standard macro?
Yes, it's supposed to be there. OTOH, so is limits.h, i.e. the HAVE_LIMITS_H test shouldn't be necessary either.
As Martin regularly points out, we have rather too much of this kind of thing. Given that we demand ANSI C, we could surely lose HAVE_LIMITS_H, and probably much other cruft.
People who have to contend with broken platforms might have a different view. I guess there's no real reason to churn, but when crap like this gets in the way we should strive to kill it.
Cheers, mwh

Michael Hudson wrote:
As Martin regularly points out, we have rather too much of this kind of thing. Given that we demand ANSI C, we could surely lose HAVE_LIMITS_H, and probably much other cruft.
Indeed, and I would encourage contributions in that direction. Make sure you put, in the commit message, an elaboration why it is safe to delete the code in question.
Regards, Martin

[Hye-Shik Chang]
BTW, do we really support architectures with 9bits-sized char?
I don't think so. There are assumptions that a char is 8 bits scattered throughout Python's code, not so much in the context of using characters *as* characters, but more indirectly by assuming that the number of *bits* in an object of a non-char type T can be computed as sizeof(T)*8.
Skip's idea of making config smarter about this is a good one, but instead of trying to "fix stuff" for a case that's probably never going to arise, and that can't really be tested anyway until it does, I'd add a block like this everywhere we know we're relying on 8-bit char:
#ifdef HAS_FUNNY_SIZE_CHAR #error "The following code needs rework when a char isn't 8 bits" #endif /* A comment explaining why the following code needs rework * when a char isn't 8 bits. */
Crays are a red herring here. It's true that some Cray *hardware* can't address anything smaller than 64 bits, and that's also true of some other architectures. char is nevertheless 8 bits on all such 64-bit boxes I know of (and since I worked in a 64-bit world for 15 years, I know about most of them <wink>). On Crays, this is achieved (albeit at major expense) in software: by *software* convention, a pointer-to-char stores the byte offset in the *most*-significant 3 bits of a pointer, and long-winded generated coded picks that part at runtime, loading or storing 8 bytes at a time (the HW can't do less than that), shifting and masking and or'ing to give the illusion of byte addressing for char. Some Alphas do something similar, but that HW's loads and stores simply ignore the last 3 bits of a memory address, and the CPU has special-purpose instructions to help generated code do the subsequent extraction and insertion of 8-bit chunks efficiently and succinctly.

Tim Peters wrote:
[Hye-Shik Chang]
BTW, do we really support architectures with 9bits-sized char?
[...]
Skip's idea of making config smarter about this is a good one, but instead of trying to "fix stuff" for a case that's probably never going to arise, and that can't really be tested anyway until it does, I'd add a block like this everywhere we know we're relying on 8-bit char:
#ifdef HAS_FUNNY_SIZE_CHAR #error "The following code needs rework when a char isn't 8 bits" #endif /* A comment explaining why the following code needs rework
- when a char isn't 8 bits.
*/
This would probably be appropriate for those TI DSP's I mentioned. While they genuinely have a 16-bit char type, they're also intended for use as co-processors, rather than as the main controller for an application. That is, a more standard CPU should be used to handle the general application programming, while the DSP is used for the serious number crunching (that's what it is made for, after all).
Anyone who _thinks_ they want to run Python on the DSP core almost certainly needs to have a long hard think about their system design.
Cheers, Nick.

Hye-Shik Chang wrote:
On Fri, Dec 19, 2003 at 09:30:27AM +0100, M.-A. Lemburg wrote:
perky@users.sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory sc8-pr-cvs1:/tmp/cvs-serv1651/Objects
Modified Files: unicodeobject.c Log Message: SF #859573: Reduce compiler warnings on gcc 3.2 and above.
Index: unicodeobject.c
*** 2204,2208 ****
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1 && *(unsigned char*)s < 256) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1); --- 2212,2216 ----
/* Latin-1 is equivalent to the first 256 ordinals in Unicode. */
! if (size == 1) { Py_UNICODE r = *(unsigned char*)s; return PyUnicode_FromUnicode(&r, 1);
This "fix" doesn't look right. Please check.
gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Objects/unicodeobject.o Objects/unicodeobject.c Objects/unicodeobject.c: In function `PyUnicodeUCS2_DecodeLatin1': Objects/unicodeobject.c:2214: warning: comparison is always true due to limited range of data type
AFAIK, *(unsigned char*)s is always smaller than 256. Also decoding latin-1 can be done by just casting it into Py_UNICODE.
You are right. I was thinking that there was some reason we needed this to get Unicode working on Crays, but looking at the CVS log, this was probably just the result of adjusting Martin's single character sharing code to work for Latin-1 rather than just ASCII characters.
participants (12)
-
David LeBlanc
-
Fredrik Lundh
-
Guido van Rossum
-
Hye-Shik Chang
-
Jack Jansen
-
M.-A. Lemburg
-
Martin v. Loewis
-
martin@v.loewis.de
-
Michael Hudson
-
Nick Coghlan
-
Skip Montanaro
-
Tim Peters