[Numpy-discussion] Extent of unicode types in numpy

Mon Feb 6 14:16:02 EST 2006

Francesc Altet wrote:

>Hi,
>
>I'm a bit surprised by the fact that unicode types are the only ones
>breaking the rule that must be specified with a different number of
>bytes than it really takes. For example:
>  
>

Right now, the array protocol typestring is a little ambiguous on 
unicode characters.  Ideally, the array interface would describe what 
kind of Unicode characters are being dealt with so that 2-byte and 
4-byte unicode characters have a different description in the typestring.

Python can be compiled with Unicode as either 2-byte or 4-byte.    The 
'U#' descriptor is supposed to be the Python unicode data-type with # 
representing the number of characters.   If this data-type is handed off 
to a Python that is compiled with a different representation for 
Unicode, then we have a problem.

Right now, the typestring value gives the number of bytes in the type.  
Thus, "U4" gives dtype("<U8") on my system where sizeof(Py_UNICODE)==2, 
but on another system it could give dtype("<U16"). 

I know only a little-bit about unicode.  The full Unicode character is a 
4-byte entity, but there are standard 2-byte  (UTF-16) and even 1-byte 
(UTF-8) encoders.

I changed the source so that ("<U8") gets interpreted the same as "U4" 
(i.e. if you specify an endianness then you are being byte-conscious 
anyway and so the number is interpreted as a byte, otherwise the number 
is interpreted as a length).  This fixes issues on the same platform, 
but does not fix issues where data is saved out with one Python 
interpreter and read in by another with a different value of 
sizeof(Py_UNICODE).

-Travis