[Python-bugs-list] [ python-Bugs-405227 ] sizeof(Py_UNICODE)==2 ????
noreply@sourceforge.net
noreply@sourceforge.net
Thu, 02 Aug 2001 03:15:10 -0700
Bugs item #405227, was opened at 2001-03-01 11:21
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470
Category: Unicode
Group: Platform-specific
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Jon Saenz (jsaenz)
Assigned to: M.-A. Lemburg (lemburg)
Summary: sizeof(Py_UNICODE)==2 ????
Initial Comment:
We are trying to install Python 2.0 in a Cray T3E.
After a painful process of removing several modules
which produce some errors (mmap, sha, md5), we get core
dumps when we run python because under this platform,
there does not exist a 16-bit numeric type. Unsigned
short is 4 bytes long.
We have finally defined unicode objects as unsigned
short, despite they are 4 bytes long, and we have
changed a sentence in
Objects/unicodeobject.c
to:
if (sizeof(Py_UNICODE)!=sizeof(unsigned short){
It compiles and runs now, but the test on regular
expressions crashes and the whole regression test does,
too.
Support of Unicode for this platform is not correct in
version 2.0 of Python.
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2001-08-02 03:15
Message:
Logged In: YES
user_id=38388
This should fixed now; see Fredrik's response.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-06-26 13:05
Message:
Logged In: YES
user_id=31435
Thank you, /F -- excellent news!
----------------------------------------------------------------------
Comment By: Fredrik Lundh (effbot)
Date: 2001-06-26 12:21
Message:
Logged In: YES
user_id=38376
in the current CVS codebase, there's a new (experimental)
define in Include/unicodeobject.h:
#undef USE_UCS4_STORAGE
if this is defined, Py_UNICODE will be set to the same
thing as Py_UCS4 (usually unsigned int or unsigned long).
currently, basic unicode functions and SRE works just fine
with this setting, but some other modules (including the
UTF-16 codec) may not work (yet).
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-18 06:17
Message:
Logged In: YES
user_id=38388
Of course, you could declare Py_UNICODE as "unsigned int"
and then store Unicode characters in e.g. 4 bytes each on
platforms which don't have a 16-bit integer type.
The reason for being picky about the 16 bits is that we
chose UTF-16 as internal data storage format and that format
defines the byte stream in terms of entities which have 2
bytes for each character. This format provides the best
low-level integration with other Unicode storage formats
such as wchar_t on Windows. That's why I would like to keep
this compatibility if at all possible.
I am not sure, but I think that sre also makes the 2-byte
assumption internally in some places.
A simple test for this would be to define Py_UNICODE as
unsigned long and then run the regression suite...
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2001-06-18 05:38
Message:
Logged In: YES
user_id=6380
Huh? That depends on how ch is declared, and what kind of
data is in the array. If it's an array of Py_UNICODE
elements, and ch is declared as "Py_UNICODE *ch;", then ch++
will do the right thing (increment it by one Py_UNICODE
unit).
Now, the one thing you can NOT assume is that if you read
external 16-bit data into a character buffer, that the
Unicode characters correspond to Py_UNICODE characters --
perhaps this is what you're after?
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-18 01:19
Message:
Logged In: YES
user_id=38388
Ok, I agree that the math will probably work in most cases
due to the fact that UTF-16 will never produce values
outside the 16-bit range, but you still have the problem
with iterating over Py_UNICODE arrays: the compiler will
assume that
ch++ means to move the pointer by sizeof(Py_UNICODE) bytes
and this breaks in case you use e.g. a 32-bit integer type
for Py_UNICODE.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-06-17 14:05
Message:
Logged In: YES
user_id=31435
The code snippet there will work fine with any integral
type >= 2 bytes if you just add the line
ch &= 0xffff;
between the computation and the "if".
It will actually work fine even if you *don't* put in that
mask, but deducing that required analysis of the specific
operations (you shift 4 bits left 12, 6 bits left 6 so they
don't overlap with the first chunk and so the "+" can't
cause a carry, and then add another chunk of non-
overlapping 6 bits, so again there's no carry, and
therefore the infinite-precision result fits in no more
than 16 bits, and so there's no need to mask).
About pointers, I don't see a problem there either, unless
you're casting a Py_UNICODE* to a char* then adding a
hardcoded 2.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 12:57
Message:
Logged In: YES
user_id=38388
The codecs are full of things like:
ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) <<
6) + (s[2] & 0x3f);
if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000))
{
errmsg = "illegal encoding";
goto utf8Error;
}
where ch is a Py_UNICODE character.
The other "problem" is that pointer dereferencing is used a
lot in the code (using arrays of Py_UNICODE chars). We could
probably shift the calculations to Py_UCS4 integers and then
only do the data buffer access with Py_UNICODE which would
then be mapped to a a 2-char-array to get the data buffer
layout right.
Still, I think this is low priority. Patches are welcome of
course :-)
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-06-17 12:44
Message:
Logged In: YES
user_id=31435
Point me to one of the calculations that's thought to be a
problem, and happy to suggest something (I didn't find one
on my own, but I'm not familiar with the details here).
BTW, I reopened this because we got another report of T3E
woes on c.l.py that day.
You certainly need at least 16 bits, but it's hard to see
how having more than that could be a genuine problem -- at
worst "this kind of thing" usually requires no more than
masking with 0xffff at the end. That can be hidden in a
macro that's a nop on platforms that don't need it, if
micro-efficiency is a concern.
Often even that isn't needed. For example, binascii_crc32
absolutely must compute a 32-bit checksum, but works fine
on platforms with 8-byte longs. The only "trick" needed to
make that work was to compute the complement via
crc ^ 0xFFFFFFFFUL
instead of via
~crc
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 11:47
Message:
Logged In: YES
user_id=38388
It may be a design error, but getting this right for all
platforms is hard and by choosing the 16-bit type we managed
to handle 95% of all platforms in a fast and reliable way.
Any idea how we could "emulate" a 16-bit integer type ? We
need the integer type because we do calculcations on the
values.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-06-13 22:28
Message:
Logged In: YES
user_id=31435
I opened this again. It's simply unacceptable to require
that the platform have a 2-byte integer type. That doesn't
mean it's easy to fix, but it's a design error all the same.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2001-03-16 11:27
Message:
Logged In: YES
user_id=38388
The current Unicode implementation needs Py_UNICODE to
be a 16-bit entity and so does SRE.
To get this to work on the Cray, you could try to use a
2-char
struct which is then cast to a short in all those places
which
assume a 16-bit number representation.
Simply using a 4-byte entity as basis will not work, since
the fact that Py_UNICODE fits into 2 bytes is hard-coded
into the implementation in a number of places.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-03-01 15:29
Message:
Logged In: YES
user_id=31435
Notes:
+ Python was ported to T3E last year, IIRC by Marc Poinot.
May want to track him down.
+ Python's Unicode support doesn't rely on any platform
Unicode support. Whether it's "useless" depends on the
user, not the platform.
+ Face it <wink>: Crays are the only platforms that don't
have a native 16-bit integer type.
+ Even so, I believe at least SRE is happy to work with 32-
bit Unicode (glibc's wchar_t is 4 bytes, IIRC), so that
much was likely a shallow problem.
----------------------------------------------------------------------
Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:09
Message:
Logged In: YES
user_id=12122
We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.
In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.
Under these circunstances, the current version of Python can
not be used.
----------------------------------------------------------------------
Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08
Message:
Logged In: YES
user_id=12122
We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.
In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.
Under these circunstances, the current version of Python can
not be used.
----------------------------------------------------------------------
Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08
Message:
Logged In: YES
user_id=12122
We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.
In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.
Under these circunstances, the current version of Python can
not be used.
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-03-01 14:05
Message:
Logged In: YES
user_id=3066
Marc-Andre, can you deal with the general Unicode issues here and then pass this along to Fredrik for SRE updates? (Or better yet, work in parallel?)
Thanks!
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470