[Numpy-discussion] Unicode revisited

Fri Aug 3 23:03:14 EDT 2012

On Fri, Aug 3, 2012 at 6:03 PM, Travis Oliphant <travis at continuum.io> wrote:
> Hey all,
>
> Ondrej has been working hard with feedback from many others on improving Unicode support in NumPy (especially for Python 3.3).   Looking at what Python has done in Python 3.3 (PEP 393) and chatting on the Python issue tracker with the author of that PEP has made me wonder if we aren't "doing the wrong thing" in NumPy quite often.
>
> Basically, NumPy only supports UTF-32 in it's Unicode representation.   All bytes in NumPy arrays should be either UTF-32LE or UTF-32BE.    This is all pretty easy to understand as long as you stick with NumPy arrays only.
>
> The difficulty starts when you start to interact with the unicode array scalar (which is the same data-structure exactly as a Python unicode object with a different type-name --- numpy.unicode_).    However, I overlooked the "encoding" argument to the standard "unicode" constructor which might have simplified what we are doing.    If I understand things correctly, now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in the array (depending on the dtype) into a unicode object.
>
> This is easily accomplished with  numpy.unicode_(<bytes object>, 'utf_32_be'  or 'utf_32_le').    There is also an "encoding" equivalent to go from the Python unicode object to the bytes representation in the NumPy array.   I think this is what we should be doing in most of the places and it should considerably simplify the Unicode code in NumPy --- eliminating possibly the ucsnarrow.c file.
>
> Am I missing something?

I guess we'll try and see. :)

Would it make sense to merge https://github.com/numpy/numpy/pull/372
now, because it will make NumPy working in Python 3.3 (and it seems to
me that the implementation is reasonable)? And then I'll work on
trying to use your new approach, both for 2.7 and 3.2 and 3.3.

Ondrej