[Python-Dev] Help with Unicode arrays in NumPy

Wed Feb 8 08:19:30 CET 2006

>>>>> "Travis" == Travis E Oliphant <oliphant.travis at ieee.org> writes:

    Travis> Numpy supports arrays of arbitrary fixed-length "records".
    Travis> It is much more than numeric-only data now.  One of the
    Travis> fields that a record can contain is a string.  If strings
    Travis> are supported, it makes sense to support unicode strings
    Travis> as well.

That is not obvious.  A string is really an array of bytes, which for
historical reasons in some places (primarily the U.S. of A.) can be
used to represent text.  Unicode, on the other hand, is intended to
represent text streams robustly and does so in a universal but
flexible way ... but all of the different Unicode transformation
formats are considered to represent the *identical* text stream.  Some
applications may specify a transformation format, others will not.

In any case, internally Python is only going to support *one*; all the
others must be read in through codecs anyway.  See below.

    Travis> This allows NumPy to memory-map arbitrary data-files on
    Travis> disk.

In the case where a transformation format *is* specified, I don't see
why you can't use a byte array field (ie, ordinary "string") of
appropriate size for this purpose, and read it through a codec when it
needs to be treated as text.  This is going to be necessary in
essentially all of the cases I encounter, because the files are UTF-8
and sane internal representations are either UTF-16 or UTF-32.  In
particular, Python's internal representation is 16 or 32 bits wide.

    Travis> Perhaps you should explain why you think NumPy "shouldn't
    Travis> support Unicode"

Because it can't, not in the way you would like to, if I understand
you correctly.  Python chooses *one* of the many standard
representations for internal use, and because of the way the standard
is specified, it doesn't matter which one!  And none of the others can
be represented directly, all must be decoded for internal use and
encoded when written back to external media.  So any memory mapping
application is inherently nonportable, even across Python
implementations.

    Travis> And Python does not support arbitrary Unicode characters
    Travis> on narrow builds?  Then how is \U0010FFFF represented?

In a way incompatible with the concept of character array.  Now what
do you do?

The point is that Unicode is intentionally designed in such a way that
a plethora of representations is possible, but all are easily and
reliably interconverted.  Implementations are then free to choose an
appropriate internal representation, knowing that conversion from
external representations is "cheap" and standardized.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.