[Python-Dev] Help with Unicode arrays in NumPy
Travis E. Oliphant
oliphant.travis at ieee.org
Tue Feb 7 21:23:13 CET 2006
Martin v. Löwis wrote:
> Travis E. Oliphant wrote:
>>Currently that means that they are "unicode" strings of basic size UCS2
>>or UCS4 depending on the platform. It is this duality that has some
>>people concerned. For all other data-types, NumPy allows the user to
>>explicitly request a bit-width for the data-type.
> Why is that a desirable property? Also: Why does have NumPy support for
> Unicode arrays in the first place?
Numpy supports arrays of arbitrary fixed-length "records". It is much
more than numeric-only data now. One of the fields that a record can
contain is a string. If strings are supported, it makes sense to
support unicode strings as well.
This allows NumPy to memory-map arbitrary data-files on disk.
Perhaps you should explain why you think NumPy "shouldn't support Unicode"
> My initial reaction is: use whatever Python uses in "NumPy Unicode".
> Upon closer inspection, it is not all that clear what operations
> are supported on a Unicode array, and how these operations relate
> to the Python Unicode type.
That is currently what is done. The current unicode data-type is
exactly what Python uses.
The chararray subclass gives to unicode and string arrays all the
methods of unicode and strings (operating on an element-by-element basis).
When you extract an element from the unicode data-type you get a Python
unicode object (every NumPy data-type has a corresponding "type-object"
that determines what is returned when an element is extracted). All of
these types are in a hierarchy of data-types which inherit from the
basic Python types when available.
> In any case, I think NumPy should have only a single "Unicode array"
> type (please do explain why having zero of them is insufficient).
Please explain why having zero of them is *sufficient*.
> If the purpose of the type is to interoperate with a Python
> unicode object, it should use the same width (as this will
> allow for mempcy).
> If the purpose is to support arbitrary Unicode characters, it should
> use 4 bytes (as two bytes are insufficient to represent arbitrary
> Unicode characters).
And Python does not support arbitrary Unicode characters on narrow
builds? Then how is \U0010FFFF represented?
> If the purpose is something else, please explain what the purpose
The purpose is to represent bytes as they might exist in a file or
data-stream according to the users specification. The purpose is
whatever the user wants them for. It's the same purpose as having an
unsigned 64-bit data-type --- because users may need it to represent
data as it exists in a file.
More information about the Python-Dev