[Numpy-discussion] PyArray_Scalar() and Unicode

Dan Roberts ademan555 at gmail.com
Sat Jun 12 20:33:13 EDT 2010


I apologize ahead of time for anything I might be totally missing, but in
order to make PyArray_Scalar() work on non-CPython interpreters, it's
necessary for me to significantly refactor that function.  I've made
(untested but correct looking) changes to the function to handle all of the
data types except Unicode.  I just got the crash course in Unicode today so
my understanding is limited.  It seems the most compatible way to turn the
UCS4 data into a PyUnicodeObject would be to first convert it to UCS2 and
then use PyUnicode_DecodeUTF16() to create the python object.
    There are a few problems with this.  The biggest problem for me is that
it appears PyUCS2Buffer_FromUCS4() doesn't produce UCS2 at all, but rather
UTF-16 since it produces surrogate pairs for code points above 0xFFFF.  My
first question is: is there any time when the data produced by
PyUCS2Buffer_FromUCS4() wouldn't be parseable by a standards compliant
UTF-16 decoder?  Aside from that, converting to UCS2, possibly after making
a word aligned copy of the original data, then converting that to the native
storage, which is likely UTF-16 anyways, is horribly wasteful.  The ideal
way to accomplish this would be to simply use PyUnicode_DecodeUTF32() on the
original data and be done with it.  The biggest problem with this approach
is it's not very compatible (Requires Python 2.6, and currently isn't
implemented in PyPy but that's fixable)
    I talked briefly to Stéfan about this and he mentioned that you were
involved in all of this and that things are in a state of flux.  So before I
devoted a significant amount of time and thought to this I thought I'd put
myself out into the open air and see if there's any major holes in my
rationale, or if things will change significantly enough that I should
adjust my approach.
Thanks,
Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100612/b43b88ff/attachment.html>


More information about the NumPy-Discussion mailing list