[Python-bugs-list] [ python-Bugs-405227 ] sizeof(Py_UNICODE)==2 ????

Thu, 02 Aug 2001 03:15:10 -0700

Bugs item #405227, was opened at 2001-03-01 11:21
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470

Category: Unicode
Group: Platform-specific
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Jon Saenz (jsaenz)
Assigned to: M.-A. Lemburg (lemburg)
Summary: sizeof(Py_UNICODE)==2 ????

Initial Comment:
We are trying to install Python 2.0 in a Cray T3E.

After a painful process of removing several modules
which produce some errors (mmap, sha, md5), we get core
dumps when we run python because under this platform,
there does not exist a 16-bit numeric type. Unsigned
short is 4 bytes long.

We have finally defined unicode objects as unsigned
short, despite they are 4 bytes long, and we have
changed a sentence in 
Objects/unicodeobject.c
to:
if (sizeof(Py_UNICODE)!=sizeof(unsigned short){

It compiles and runs now, but the test on regular
expressions crashes and the whole regression test does,
too.

Support of Unicode for this platform is not correct in
version 2.0 of Python.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2001-08-02 03:15

Message:
Logged In: YES 
user_id=38388

This should fixed now; see Fredrik's response.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-26 13:05

Message:
Logged In: YES 
user_id=31435

Thank you, /F -- excellent news!

----------------------------------------------------------------------

Comment By: Fredrik Lundh (effbot)
Date: 2001-06-26 12:21

Message:
Logged In: YES 
user_id=38376

in the current CVS codebase, there's a new (experimental) 
define in Include/unicodeobject.h:

    #undef USE_UCS4_STORAGE

if this is defined, Py_UNICODE will be set to the same 
thing as Py_UCS4 (usually unsigned int or unsigned long).  
currently, basic unicode functions and SRE works just fine 
with this setting, but some other modules (including the 
UTF-16 codec) may not work (yet).

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-18 06:17

Message:
Logged In: YES 
user_id=38388

Of course, you could declare Py_UNICODE as "unsigned int"
and then store Unicode characters in e.g. 4 bytes each on
platforms which don't have a 16-bit integer type. 

The reason for being picky about the 16 bits is that we
chose UTF-16 as internal data storage format and that format
defines the byte stream in terms of entities which have 2
bytes for each character. This format provides the best
low-level integration with other Unicode storage formats
such as wchar_t on Windows. That's why I would like to keep
this compatibility if at all possible.

I am not sure, but I think that sre also makes the 2-byte
assumption internally in some places.

A simple test for this would be to define Py_UNICODE as
unsigned long and then run the regression suite...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-06-18 05:38

Message:
Logged In: YES 
user_id=6380

Huh?  That depends on how ch is declared, and what kind of
data is in the array.  If it's an array of Py_UNICODE
elements, and ch is declared as "Py_UNICODE *ch;", then ch++
will do the right thing (increment it by one Py_UNICODE
unit).

Now, the one thing you can NOT assume is that if you read
external 16-bit data into a character buffer, that the
Unicode characters correspond to Py_UNICODE characters --
perhaps this is what you're after?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-18 01:19

Message:
Logged In: YES 
user_id=38388

Ok, I agree that the math will probably work in most cases
due to the fact that UTF-16 will never produce values
outside the 16-bit range, but you still have the problem
with iterating over Py_UNICODE arrays: the compiler will
assume that
ch++ means to move the pointer by sizeof(Py_UNICODE) bytes
and this breaks in case you use e.g. a 32-bit integer type
for Py_UNICODE.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-17 14:05

Message:
Logged In: YES 
user_id=31435

The code snippet there will work fine with any integral 
type >= 2 bytes if you just add the line

ch &= 0xffff;

between the computation and the "if".

It will actually work fine even if you *don't* put in that 
mask, but deducing that required analysis of the specific 
operations (you shift 4 bits left 12, 6 bits left 6 so they 
don't overlap with the first chunk and so the "+" can't 
cause a carry, and then add another chunk of non-
overlapping 6 bits, so again there's no carry, and 
therefore the infinite-precision result fits in no more 
than 16 bits, and so there's no need to mask).

About pointers, I don't see a problem there either, unless 
you're casting a Py_UNICODE* to a char* then adding a 
hardcoded 2.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 12:57

Message:
Logged In: YES 
user_id=38388

The codecs are full of things like:

            ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) <<
6) + (s[2] & 0x3f);
            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000))
{
                errmsg = "illegal encoding";
                goto utf8Error;
            }

where ch is a Py_UNICODE character.

The other "problem" is that pointer dereferencing is used a
lot in the code (using arrays of Py_UNICODE chars). We could
probably shift the calculations to Py_UCS4 integers and then
only do the data buffer access with Py_UNICODE which would
then be mapped to a a 2-char-array to get the data buffer
layout right.

Still, I think this is low priority. Patches are welcome of
course :-)

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-17 12:44

Message:
Logged In: YES 
user_id=31435

Point me to one of the calculations that's thought to be a 
problem, and happy to suggest something (I didn't find one 
on my own, but I'm not familiar with the details here).  
BTW, I reopened this because we got another report of T3E 
woes on c.l.py that day.

You certainly need at least 16 bits, but it's hard to see 
how having more than that could be a genuine problem -- at 
worst "this kind of thing" usually requires no more than 
masking with 0xffff at the end.  That can be hidden in a 
macro that's a nop on platforms that don't need it, if 
micro-efficiency is a concern.

Often even that isn't needed.  For example, binascii_crc32 
absolutely must compute a 32-bit checksum, but works fine 
on platforms with 8-byte longs.  The only "trick" needed to 
make that work was to compute the complement via

crc ^ 0xFFFFFFFFUL

instead of via

~crc

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-06-17 11:47

Message:
Logged In: YES 
user_id=38388

It may be a design error, but getting this right for all
platforms is hard and by choosing the 16-bit type we managed
to handle 95% of all platforms in a fast and reliable way.

Any idea how we could "emulate" a 16-bit integer type ? We
need the integer type because we do calculcations on the
values.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-06-13 22:28

Message:
Logged In: YES 
user_id=31435

I opened this again.  It's simply unacceptable to require 
that the platform have a 2-byte integer type.  That doesn't 
mean it's easy to fix, but it's a design error all the same.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-03-16 11:27

Message:
Logged In: YES 
user_id=38388

The current Unicode implementation needs Py_UNICODE to
be a 16-bit entity and so does SRE.

To get this to work on the Cray, you could try to use a
2-char
struct which is then cast to a short in all those places
which
assume a 16-bit number representation.

Simply using a 4-byte entity as basis will not work, since
the fact that Py_UNICODE fits into 2 bytes is hard-coded
into the implementation in a number of places.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-03-01 15:29

Message:
Logged In: YES 
user_id=31435

Notes:

+ Python was ported to T3E last year, IIRC by Marc Poinot.  
May want to track him down.

+ Python's Unicode support doesn't rely on any platform 
Unicode support.  Whether it's "useless" depends on the 
user, not the platform.

+ Face it <wink>:  Crays are the only platforms that don't 
have a native 16-bit integer type.

+ Even so, I believe at least SRE is happy to work with 32-
bit Unicode (glibc's wchar_t is 4 bytes, IIRC), so that 
much was likely a shallow problem.

----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:09

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Jon Saenz (jsaenz)
Date: 2001-03-01 15:08

Message:
Logged In: YES 
user_id=12122

We have finally given up to install Python in the Cray T3E
due to its lack of support of shared objects. This causes
difficulties in the loading of different external libraries
(Numeric, Lapack, and so on) because of the static linking.

In any case, we still think that this "bug" should be
repaired. There may be other platforms which:
1) Do not support Unicode, so that the Unicode feature of
Python is useless in these cases.
2) The users may be interested in using Python in them (for
Numeric applications, for instance)
3) May not have a 16-bit native numerical type.

Under these circunstances, the current version of Python can
not be used.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-03-01 14:05

Message:
Logged In: YES 
user_id=3066

Marc-Andre, can you deal with the general Unicode issues here and then pass this along to Fredrik for SRE updates?  (Or better yet, work in parallel?)

Thanks!

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=405227&group_id=5470