[Numpy-discussion] Unicode revisited

Sat Aug 4 06:42:09 EDT 2012

Travis Oliphant <travis at continuum.io> wrote:
> The difficulty starts when you start to interact with the unicode array scalar (which is the same data-structure exactly as a Python unicode object with a different type-name --- numpy.unicode_).    However, I overlooked the "encoding" argument to the standard "unicode" constructor which might have simplified what we are doing.    If I understand things correctly, now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in the array (depending on the dtype) into a unicode object. 
> 
> This is easily accomplished with  numpy.unicode_(<bytes object>, 'utf_32_be'  or 'utf_32_le').    There is also an "encoding" equivalent to go from the Python unicode object to the bytes representation in the NumPy array.   I think this is what we should be doing in most of the places and it should considerably simplify the Unicode code in NumPy --- eliminating possibly the ucsnarrow.c file. 

That sounds right to me. On the C-level for PyArray_Scalar() this should work for
all Python versions >= 2.6, provided that data is aligned in the case of a narrow
build:

    /* data is assumed to be aligned */
    if (type_num == NPY_UNICODE) {
        PyObject *u;
        PyObject *args;
        int byteorder;

        switch (descr->byteorder) {
        case '<':
            byteorder = -1;
        case '>':
            byteorder = 1;
        default: /* '=', '|' */
            byteorder = 0;
        }

        /* function exists since 2.6 */
        u = PyUnicode_DecodeUTF32(data, itemsize, NULL, &byteorder);
        if (u == NULL) {
            return NULL;
        }

        args = Py_BuildValue("(N)", u);
        if (args == NULL) {
            return NULL;
        }

        u = type->tp_new(type, args, NULL);
        Py_DECREF(args);
        return u;
    }

All newbyteorder() test have to be deleted of course. I also think that
ucsnarrow.c is no longer needed.

Stefan Krah