[Numpy-discussion] Extent of unicode types in numpy
Travis Oliphant
oliphant.travis at ieee.org
Wed Feb 8 00:42:03 EST 2006
Francesc Altet wrote:
>Ok. I see that you got my point. Well, maybe I'm wrong here, but my
>proposal would result in implementing just one new data-type for 32-bit
>unicode when the python platform is UCS2 aware. If, as you said above,
>Py_UCS4 type is always defined, even on UCS2 interpreters, that should
>be relatively easy to do.
>
Hmm. I think I'm beginning to like your idea. We could in fact make
the NumPy Unicode type always UCS4 and then keep the Python Unicode
scalar. On Python UCS2 builds the conversion would use UTF-16 to go to
the Python scalar (which would always inherit from the native unicode
type).
It would be one data-type where there was not an identical match in the
memory layout of the scalar and the array data-type, but because in this
case there are conversions to go back and forth, it may not matter.
This would not be too difficult to implement, actually --- it would
require new functions to handle conversions in arraytypes.inc.src and
some modifications to PyArray_Scalar. The only draw-back is that now
all unicode arrays are twice as large and the aforementioned asymmetry
between the data-type and the array-scalar on Python UCS2 builds.
But, all in all, it sounds like a good plan. If the time comes that
somebody wants to add a reduced-size USC2 array of unicode characters
then we can cross that bridge if and when it comes up.
I still like using explicit typecode characters in the array interface
to denote UCS2 or the UCS4 data-type. We could still change from 'W',
'w' to other characters...
>Well, probably I've overlooked something, but I really think that this
>would be a nice thing to do.
>
>
There are details in the scalar-array conversions (getitem and setitem
that would have to be implemented but it is possible. The UCS4 -->
UTF-16 encoding is one of the easiest. It's done in unicodeobject.h in
Python, but I'm not sure it's exposed other than going through the
interpreter.
Does this seem like a solution that everyone can live with?
-Travis
More information about the NumPy-Discussion
mailing list