[Numpy-discussion] newunicode branch started to fix unicode to always be UCS4

Thu Feb 9 04:50:03 EST 2006

A Dijous 09 Febrer 2006 06:36, Travis Oliphant va escriure:
> Travis Oliphant wrote:
> > I've started a branch on SVN to fix the unicode implementation in
> > NumPy so that internally all unicode arrays use UCS4.  When a scalar
> > is obtained it will be the Python unicode scalar and the required
> > conversions (and data-copying) will be done.
> > If anybody would like to help the branch is
>
> Well, it turned out not to be too difficult. It is done.

Oh my! If I wouldn't have met you in person I would tend to think that
you are not human ;-)

> All Unicode
> arrays are now always 4-bytes-per character in NumPy.   The length is
> specified in terms of characters (not bytes).  This is different than
> other types, but it's consistent with the use of Unicode as characters.

Yes, I think this is a good idea.

> The array-scalar that a unicode array produces inherits directly from
> Python unicode type which has either 2 or 4 bytes depending on the build.
>
> On narrow builds where Python unicode is only 2-bytes, the 4-byte
> unicode is converted to 2-byte using surrogate pairs.

Very good!

> There may be lingering bugs of course, so please try it out and report
> problems.

Well, I've tried it for a while and it seems to me that you made a
very good job! Just a little thing:

# Using an UCS4 interpreter here
>>> len(buffer(numpy.array("qsds", 'U4')[()]))
16
>>> numpy.array("qsds", 'U4')[()].dtype
dtype('<U4')
>>> len(buffer(numpy.array("qsds", 'U3')[()]))
12
>>> numpy.array("qsds", 'U3')[()].dtype
dtype('<U3')

so far so good. But in UCS2 we have:

# Using an UCS2 interpreter here
>>> len(buffer(numpy.array("qsds", 'U4')[()]))
8   # Fine
>>> numpy.array("qsds", 'U4')[()].dtype
dtype('<U2')   # Shouldn't be U4?
>>> len(buffer(numpy.array("qsds", 'U3')[()]))
6   # Fine
>>> numpy.array("qsds", 'U3')[()].dtype
dtype('<U1')   # Shouldn't be U3?

I'll try to do more serious tests and contribute them back in a series
of test units.

Finally, one final consideration. From a FAQ about Unicode
(http://www.cl.cam.ac.uk/~mgk25/unicode.html), one can read:

"""
No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16,
and UTF-32, though ISO 10646-1 says that Bigendian should be preferred
unless otherwise agreed. It has become customary to append the letters
?BE? (Bigendian, high-byte first) and ?LE? (Littleendian, low-byte
first) to the encoding names in order to explicitly specify a byte
order.
"""

In NumPy, it seems that the endianess is the same of the platform,
while the ISO recomendation seems to say that Big-endian would be
preferred. I don't know which is the convention in Python about this,
but in any case, I'd follow Python convention, not the ISO one.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"