
Hi everybody, I've just uploaded a new Unicode snapshot. It includes a brand new UTF-16 codec which is BOM mark aware, meaning that it recognizes BOM marks on input and adjusts the byte order accordingly. On output you can choose to have BOM marks written or specifically define a byte order to use. Also new in this snapshot is configuration code which figures out the byte order on the installation machine... I looked everywhere in the Python source code but couldn't find any hint whether this was already done in some place, so I simply added some autoconf magic to have two new symbols defined: BYTEORDER_IS_LITTLE_ENDIAN and BYTEORDER_IS_BIG_ENDIAN (mutually exclusive of course). BTW, I changed the hash method of Unicode objects to use the UTF-8 string as basis for the hash code. This means that u'abc' and 'abc' will now be treated as the same dictionary key ! Some documentation also made into the snapshot. See the file Misc/unicode.txt for all the interesting details about the implementation. Note that the web page provides a prepatched version of the interpreter for your convenience... just download, run ./configure and make and your done. Could someone with access to a MS VC compiler please update the project files and perhaps post me some feedback about any glitches ?! I have never compiled Python on Windows myself and don't have the time to figure out just now :-/. Thanks :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters wrote:
otoh, figuring out the byte order is one of the things autoconf do very well. if they're not there already, Python's autoconf should include the basic "platform metrics" macros: AC_HEADER_STDC AC_C_INLINE AC_C_BIGENDIAN AC_CHECK_SIZEOF(char) AC_CHECK_SIZEOF(short) AC_CHECK_SIZEOF(int) AC_CHECK_SIZEOF(long) AC_CHECK_SIZEOF(float) AC_CHECK_SIZEOF(double) AC_C_CONST (think "extension writers", not necessarily "python core") </F>

Fredrik Lundh wrote:
Should I add these, Guido ? -- I'd rather stick with predefined macros than cook my own. The AC_C_INLINE would be esp. interesting here: I think this could be used a lot for those tiny function which just apply a type check and then return some object attribute value. The AC_C_CONST frightens me a bit: the Unicode code uses "const" a lot to make sure compilers can do the right optimizations. Are there compilers out there which do not handle "const" correctly ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Should I add these, Guido ? -- I'd rather stick with predefined macros than cook my own.
The AC_C_INLINE would be esp. interesting here:
umm. since inline isn't really part of ANSI C, that means that you'll end up having possibly non-inlined code in header files, right? (I use inline agressively inside modules, except for really critical things that absolutely definitively must be inlined -- look in PIL to see what I mean...)
not sure about this; I just copied the list from PIL, and should probably have left this one out. I've don't think I've ever used it, and afaik, 1.6 will no longer support non-ANSI compilers anyway... </F>

Fredrik Lundh wrote:
Hmm, it would probably cause code to go into header files -- not really good style but perhaps C++ has leveraged this a bit recently ;-)
Uff, glad you said that :-) BTW, has anyone tried to compile the Unicode stuff on Windows yet ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters wrote:
I looked there, but only found that it uses native byte order by means of "letting the compiler do the right thing" -- there doesn't seem to be any code which actually tests for it. The autoconf stuff is pretty simple, BTW. The following code is used for the test: main() { long x = 0x34333231; /* == "1234" on little endian machines */ char *y = (char *)&x; if (strncmp(y,"1234",4)) exit(0); /* big endian */ else exit(1); /* little endian */ } This should be ok on big endian machines... even though I haven't tested it due to lack of access to such a beast. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
[Tim]
There's a tiny bit of inline code for this in the "host byte order" case of structmodule.c's function whichtable. ...
[MAL]
Here's the "tiny bit of (etc)": int n = 1; char *p = (char *) &n; if (*p == 1) ...
No, no, no -- that's one "no" for each distinct way I know of that can fail on platforms where sizeof(long) == 8 <wink>. Don't *ever* use longs to test endianness; besides the obvious problems, it also sucks you into illusions unique to "mixed endian" architectures. "ints" are iffy too, but less so. Test what you're actually concerned about, as directly and simply as possible; e.g., if you're actually concerned about how the machine stores shorts, do what structmodule does but use a short instead of an int. And if it's important, explicitly verify that sizeof(short)==2 (& raise an error if it's not).

Tim Peters wrote:
Hmm, haven't noticed that one (but Jean posted the same idea in private mail ;).
I've turned to the autoconf predefined standard macro as suggested by Fredrik. It does the above plus some other magic as well to find out endianness. On big endian machines the configure script now defines WORDS_BIGENDIAN. The sizeof(Py_UNICODE)==2 assertion is currently tested at init time of the Unicode implementation. I would like to add Fredriks proposed sizeof checks to the configure script too, but there's a catch: the config.h in PC/ is hand generated and would need some updates for the various PC targets. Any volunteer ? We'd need the following extra data: /* The number of bytes in a char. */ #define SIZEOF_CHAR 1 /* The number of bytes in a double. */ #define SIZEOF_DOUBLE 8 /* The number of bytes in a float. */ #define SIZEOF_FLOAT 4 /* The number of bytes in a short. */ #define SIZEOF_SHORT 2 plus maybe /* Endianness. PCs are usually little endian, so we don't define this here... */ /* #undef WORDS_BIGENDIAN */ -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters wrote:
otoh, figuring out the byte order is one of the things autoconf do very well. if they're not there already, Python's autoconf should include the basic "platform metrics" macros: AC_HEADER_STDC AC_C_INLINE AC_C_BIGENDIAN AC_CHECK_SIZEOF(char) AC_CHECK_SIZEOF(short) AC_CHECK_SIZEOF(int) AC_CHECK_SIZEOF(long) AC_CHECK_SIZEOF(float) AC_CHECK_SIZEOF(double) AC_C_CONST (think "extension writers", not necessarily "python core") </F>

Fredrik Lundh wrote:
Should I add these, Guido ? -- I'd rather stick with predefined macros than cook my own. The AC_C_INLINE would be esp. interesting here: I think this could be used a lot for those tiny function which just apply a type check and then return some object attribute value. The AC_C_CONST frightens me a bit: the Unicode code uses "const" a lot to make sure compilers can do the right optimizations. Are there compilers out there which do not handle "const" correctly ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Should I add these, Guido ? -- I'd rather stick with predefined macros than cook my own.
The AC_C_INLINE would be esp. interesting here:
umm. since inline isn't really part of ANSI C, that means that you'll end up having possibly non-inlined code in header files, right? (I use inline agressively inside modules, except for really critical things that absolutely definitively must be inlined -- look in PIL to see what I mean...)
not sure about this; I just copied the list from PIL, and should probably have left this one out. I've don't think I've ever used it, and afaik, 1.6 will no longer support non-ANSI compilers anyway... </F>

Fredrik Lundh wrote:
Hmm, it would probably cause code to go into header files -- not really good style but perhaps C++ has leveraged this a bit recently ;-)
Uff, glad you said that :-) BTW, has anyone tried to compile the Unicode stuff on Windows yet ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Tim Peters wrote:
I looked there, but only found that it uses native byte order by means of "letting the compiler do the right thing" -- there doesn't seem to be any code which actually tests for it. The autoconf stuff is pretty simple, BTW. The following code is used for the test: main() { long x = 0x34333231; /* == "1234" on little endian machines */ char *y = (char *)&x; if (strncmp(y,"1234",4)) exit(0); /* big endian */ else exit(1); /* little endian */ } This should be ok on big endian machines... even though I haven't tested it due to lack of access to such a beast. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[M.-A. Lemburg]
[Tim]
There's a tiny bit of inline code for this in the "host byte order" case of structmodule.c's function whichtable. ...
[MAL]
Here's the "tiny bit of (etc)": int n = 1; char *p = (char *) &n; if (*p == 1) ...
No, no, no -- that's one "no" for each distinct way I know of that can fail on platforms where sizeof(long) == 8 <wink>. Don't *ever* use longs to test endianness; besides the obvious problems, it also sucks you into illusions unique to "mixed endian" architectures. "ints" are iffy too, but less so. Test what you're actually concerned about, as directly and simply as possible; e.g., if you're actually concerned about how the machine stores shorts, do what structmodule does but use a short instead of an int. And if it's important, explicitly verify that sizeof(short)==2 (& raise an error if it's not).

Tim Peters wrote:
Hmm, haven't noticed that one (but Jean posted the same idea in private mail ;).
I've turned to the autoconf predefined standard macro as suggested by Fredrik. It does the above plus some other magic as well to find out endianness. On big endian machines the configure script now defines WORDS_BIGENDIAN. The sizeof(Py_UNICODE)==2 assertion is currently tested at init time of the Unicode implementation. I would like to add Fredriks proposed sizeof checks to the configure script too, but there's a catch: the config.h in PC/ is hand generated and would need some updates for the various PC targets. Any volunteer ? We'd need the following extra data: /* The number of bytes in a char. */ #define SIZEOF_CHAR 1 /* The number of bytes in a double. */ #define SIZEOF_DOUBLE 8 /* The number of bytes in a float. */ #define SIZEOF_FLOAT 4 /* The number of bytes in a short. */ #define SIZEOF_SHORT 2 plus maybe /* Endianness. PCs are usually little endian, so we don't define this here... */ /* #undef WORDS_BIGENDIAN */ -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (3)
-
Fredrik Lundh
-
M.-A. Lemburg
-
Tim Peters