[Numpy-discussion] Extent of unicode types in numpy
Travis Oliphant
oliphant at ee.byu.edu
Mon Feb 6 14:16:02 EST 2006
Francesc Altet wrote:
>Hi,
>
>I'm a bit surprised by the fact that unicode types are the only ones
>breaking the rule that must be specified with a different number of
>bytes than it really takes. For example:
>
>
Right now, the array protocol typestring is a little ambiguous on
unicode characters. Ideally, the array interface would describe what
kind of Unicode characters are being dealt with so that 2-byte and
4-byte unicode characters have a different description in the typestring.
Python can be compiled with Unicode as either 2-byte or 4-byte. The
'U#' descriptor is supposed to be the Python unicode data-type with #
representing the number of characters. If this data-type is handed off
to a Python that is compiled with a different representation for
Unicode, then we have a problem.
Right now, the typestring value gives the number of bytes in the type.
Thus, "U4" gives dtype("<U8") on my system where sizeof(Py_UNICODE)==2,
but on another system it could give dtype("<U16").
I know only a little-bit about unicode. The full Unicode character is a
4-byte entity, but there are standard 2-byte (UTF-16) and even 1-byte
(UTF-8) encoders.
I changed the source so that ("<U8") gets interpreted the same as "U4"
(i.e. if you specify an endianness then you are being byte-conscious
anyway and so the number is interpreted as a byte, otherwise the number
is interpreted as a length). This fixes issues on the same platform,
but does not fix issues where data is saved out with one Python
interpreter and read in by another with a different value of
sizeof(Py_UNICODE).
-Travis
More information about the NumPy-Discussion
mailing list