[Numpy-discussion] Extent of unicode types in numpy

Travis Oliphant oliphant.travis at ieee.org
Wed Feb 8 00:42:03 EST 2006


Francesc Altet wrote:

>Ok. I see that you got my point. Well, maybe I'm wrong here, but my
>proposal would result in implementing just one new data-type for 32-bit
>unicode when the python platform is UCS2 aware. If, as you said above,
>Py_UCS4 type is always defined, even on UCS2 interpreters, that should
>be relatively easy to do. 
>
Hmm.  I think I'm beginning to like your idea.   We could in fact make 
the NumPy Unicode type always UCS4 and then keep the Python Unicode 
scalar.  On Python UCS2 builds the conversion would use UTF-16 to go to 
the Python scalar (which would always inherit from the native unicode 
type). 

It would be one data-type where there was not an identical match in the 
memory layout of the scalar and the array data-type, but because in this 
case there are conversions to go back and forth, it may not matter.  

This would not be too difficult to implement, actually --- it would 
require new functions to handle conversions in arraytypes.inc.src and 
some modifications to PyArray_Scalar.  The only draw-back is that now 
all unicode arrays are twice as large and the aforementioned asymmetry 
between the data-type and the array-scalar on Python UCS2 builds.

But, all in all, it sounds like a good plan. If the time comes that 
somebody wants to add a reduced-size USC2 array of unicode characters 
then we can cross that bridge if and when it comes up.

I still like using explicit typecode characters in the array interface 
to denote UCS2 or the UCS4 data-type.  We could still change from 'W', 
'w' to other characters...

>Well, probably I've overlooked something, but I really think that this
>would be a nice thing to do.
>  
>
There are details in the scalar-array conversions (getitem and setitem 
that would have to be implemented but it is possible.  The UCS4 --> 
UTF-16 encoding is one of the easiest.  It's done in unicodeobject.h in 
Python, but I'm not sure it's exposed other than going through the 
interpreter.

Does this seem like a solution that everyone can live with?

-Travis





More information about the NumPy-Discussion mailing list