[Numpy-discussion] newunicode branch started to fix unicode to always be UCS4

Wed Feb 8 21:37:03 EST 2006

Travis Oliphant wrote:

>
> I've started a branch on SVN to fix the unicode implementation in 
> NumPy so that internally all unicode arrays use UCS4.  When a scalar 
> is obtained it will be the Python unicode scalar and the required 
> conversions (and data-copying) will be done.
> If anybody would like to help the branch is
>
Well, it turned out not to be too difficult.  It is done.   All Unicode 
arrays are now always 4-bytes-per character in NumPy.   The length is 
specified in terms of characters (not bytes).  This is different than 
other types, but it's consistent with the use of Unicode as characters.

The array-scalar that a unicode array produces inherits directly from 
Python unicode type which has either 2 or 4 bytes depending on the build.

On narrow builds where Python unicode is only 2-bytes, the 4-byte 
unicode is converted to 2-byte using surrogate pairs.  

There may be lingering bugs of course, so please try it out and report 
problems.

-Travis