[Python-Dev] Help with Unicode arrays in NumPy

Tue Feb 7 21:23:13 CET 2006

Martin v. Löwis wrote:
> Travis E. Oliphant wrote:
> 
>>Currently that means that they are "unicode" strings of basic size UCS2 
>>or UCS4 depending on the platform.  It is this duality that has some 
>>people concerned.  For all other data-types, NumPy allows the user to 
>>explicitly request a bit-width for the data-type.
> 
> 
> Why is that a desirable property? Also: Why does have NumPy support for
> Unicode arrays in the first place?
> 

Numpy supports arrays of arbitrary fixed-length "records".  It is much 
more than numeric-only data now.  One of the fields that a record can 
contain is a string.  If strings are supported, it makes sense to 
support unicode strings as well.

This allows NumPy to memory-map arbitrary data-files on disk.

Perhaps you should explain why you think NumPy "shouldn't support Unicode"

> 
> My initial reaction is: use whatever Python uses in "NumPy Unicode".
> Upon closer inspection, it is not all that clear what operations
> are supported on a Unicode array, and how these operations relate
> to the Python Unicode type.

That is currently what is done.  The current unicode data-type is 
exactly what Python uses.

The chararray subclass gives to unicode and string arrays all the 
methods of unicode and strings (operating on an element-by-element basis).

When you extract an element from the unicode data-type you get a Python 
unicode object (every NumPy data-type has a corresponding "type-object" 
that determines what is returned when an element is extracted).  All of 
these types are in a hierarchy of data-types which inherit from the 
basic Python types when available.

> 
> In any case, I think NumPy should have only a single "Unicode array"
> type (please do explain why having zero of them is insufficient).
> 

Please explain why having zero of them is *sufficient*.

> If the purpose of the type is to interoperate with a Python
> unicode object, it should use the same width (as this will
> allow for mempcy).
> 
> If the purpose is to support arbitrary Unicode characters, it should
> use 4 bytes (as two bytes are insufficient to represent arbitrary
> Unicode characters).

And Python does not support arbitrary Unicode characters on narrow 
builds?  Then how is \U0010FFFF represented?

> 
> If the purpose is something else, please explain what the purpose
> is.

The purpose is to represent bytes as they might exist in a file or 
data-stream according to the users specification.  The purpose is 
whatever the user wants them for.  It's the same purpose as having an 
unsigned 64-bit data-type --- because users may need it to represent 
data as it exists in a file.