[I18n-sig] UCS-4 configuration
Tim Peters
tim.one@home.com
Wed, 27 Jun 2001 04:24:44 -0400
[Martin v. Loewis]
> I would never remotely consider questioning your authority, how could I?
LOL! If authority were of any help in getting software to work, Guido
wouldn't need any of us: he could just scowl at it, and it would all fall
into place <wink>.
> The specific code in question is in PyUnicode_DecodeUTF16. It gets a
> char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short).
> It then fetches a Py_UCS2 after another, byte-swapping if appropriate,
> and advances the Py_UCS2* by one. The intention is that this retrieves
> the bytes of the input in pairs.
>
> Is that code correct even if sizeof(unsigned short)>2?
Oh no. Clearly, if sizeof(Py_UCS2) > 2, it will read more than 2 bytes each
time. But the *obvious* way to read two bytes is to use a char* pointer!
Say q and e were declared
const unsigned char*
instead of Py_UCS2*. Then for big-endian getting "the next" char is just
ch = (q[0] << 8) | q[1];
q += 2;
and swap "0" and "1" for a little-endian machine. The code would get
substantially simpler. In fact, you can skip all the embedded #ifdefs and
repeated (bo == 1), (bo == -1) tests by setting up invariants
int lo_index, hi_index;
appropriately at the start before the loop-- setting one of those to 1 and
the other to 0 --and then do
ch = (q[hi_index] << 8) | q[lo_index]
q += 2;
unconditionally inside the loop whenever fetching another pair. Now C
doesn't guarantee that a byte is 8 bits either, but that's one thing that's
true even on a Cray (they actually read 64 bits under the covers and
shift+mask, but it looks like "8 bits" to C code); I don't know of any
modern box on which it isn't true, and it's exceedingly unlikely any new
architecture won't play along.
Everything else should "just work" then. BTW, the existing byte-swapping
code doesn't work right either for sizeof(Py_UCS2) > 2, because in
ch = (ch >> 8) | (ch << 8);
there's an assumption that the left shift is end-off. Fetch a byte at a
time as above and none of that fiddling is needed. Else the existing
byte-swapping code needs either
ch &= 0xffff;
after, or
ch = (ch >> 8) | ((ch & 0xff) << 8);
in the body. But we'd be better off getting rid of Py_UCS2 thingies
entirely in this routine (they don't *mean* "UCS2", they *mean* "exactly two
bytes", and that can't always be met).