[Numpy-discussion] Unicode revisited
Stefan Krah
stefan-usenet at bytereef.org
Sat Aug 4 06:42:09 EDT 2012
Travis Oliphant <travis at continuum.io> wrote:
> The difficulty starts when you start to interact with the unicode array scalar (which is the same data-structure exactly as a Python unicode object with a different type-name --- numpy.unicode_). However, I overlooked the "encoding" argument to the standard "unicode" constructor which might have simplified what we are doing. If I understand things correctly, now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in the array (depending on the dtype) into a unicode object.
>
> This is easily accomplished with numpy.unicode_(<bytes object>, 'utf_32_be' or 'utf_32_le'). There is also an "encoding" equivalent to go from the Python unicode object to the bytes representation in the NumPy array. I think this is what we should be doing in most of the places and it should considerably simplify the Unicode code in NumPy --- eliminating possibly the ucsnarrow.c file.
That sounds right to me. On the C-level for PyArray_Scalar() this should work for
all Python versions >= 2.6, provided that data is aligned in the case of a narrow
build:
/* data is assumed to be aligned */
if (type_num == NPY_UNICODE) {
PyObject *u;
PyObject *args;
int byteorder;
switch (descr->byteorder) {
case '<':
byteorder = -1;
case '>':
byteorder = 1;
default: /* '=', '|' */
byteorder = 0;
}
/* function exists since 2.6 */
u = PyUnicode_DecodeUTF32(data, itemsize, NULL, &byteorder);
if (u == NULL) {
return NULL;
}
args = Py_BuildValue("(N)", u);
if (args == NULL) {
return NULL;
}
u = type->tp_new(type, args, NULL);
Py_DECREF(args);
return u;
}
All newbyteorder() test have to be deleted of course. I also think that
ucsnarrow.c is no longer needed.
Stefan Krah
More information about the NumPy-Discussion
mailing list