[Numpy-discussion] newunicode branch started to fix unicode to always be UCS4
oliphant.travis at ieee.org
Wed Feb 8 21:37:03 EST 2006
Travis Oliphant wrote:
> I've started a branch on SVN to fix the unicode implementation in
> NumPy so that internally all unicode arrays use UCS4. When a scalar
> is obtained it will be the Python unicode scalar and the required
> conversions (and data-copying) will be done.
> If anybody would like to help the branch is
Well, it turned out not to be too difficult. It is done. All Unicode
arrays are now always 4-bytes-per character in NumPy. The length is
specified in terms of characters (not bytes). This is different than
other types, but it's consistent with the use of Unicode as characters.
The array-scalar that a unicode array produces inherits directly from
Python unicode type which has either 2 or 4 bytes depending on the build.
On narrow builds where Python unicode is only 2-bytes, the 4-byte
unicode is converted to 2-byte using surrogate pairs.
There may be lingering bugs of course, so please try it out and report
More information about the NumPy-Discussion