[Numpy-discussion] Status of NumPy and Python 3.3
Ondřej Čertík
ondrej.certik at gmail.com
Sat Jul 28 20:09:20 EDT 2012
On Sat, Jul 28, 2012 at 3:31 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
> On Sat, Jul 28, 2012 at 3:04 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
>> Many of the failures in
>> https://gist.github.com/3194707/5696c8d3091b16ba8a9f00a921d512ed02e94d71
>> are of the type:
>>
>> ======================================================================
>> FAIL: Check byteorder of single-dimensional objects
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>> File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py",
>> line 286, in test_valuesSD
>> self.assertTrue(ua[0] != ua2[0])
>> AssertionError: False is not true
>>
>>
>> and those are caused by the following minimal example:
>>
>> Python 3.2:
>>
>>>>> from numpy import array
>>>>> a = array(["abc"])
>>>>> b = a.newbyteorder()
>>>>> a.dtype
>> dtype('<U3')
>>>>> b.dtype
>> dtype('>U3')
>>>>> a[0].dtype
>> dtype('<U3')
>>>>> b[0].dtype
>> dtype('<U6')
>>>>> a[0] == b[0]
>> False
>>>>> a[0]
>> 'abc'
>>>>> b[0]
>> 'ៀ\udc00埀\udc00韀\udc00'
>>
>>
>> Python 3.3:
>>
>>
>>>>> from numpy import array
>>>>> a = array(["abc"])
>>>>> b = a.newbyteorder()
>>>>> a.dtype
>> dtype('<U3')
>>>>> b.dtype
>> dtype('>U3')
>>>>> a[0].dtype
>> dtype('<U3')
>>>>> b[0].dtype
>> dtype('<U3')
>>>>> a[0] == b[0]
>> True
>>>>> a[0]
>> 'abc'
>>>>> b[0]
>> 'abc'
>>
>>
>> So somehow the newbyteorder() method doesn't change the dtype of the
>> elements in our new code.
>> This method is implemented in numpy/core/src/multiarray/descriptor.c
>> (I think), but so far I don't see
>> where the problem could be.
>>
>> Any ideas?
>
> Ok, after some investigating, I think we need to do something along these lines:
>
> diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
> index c134aed..daf7fc4 100644
> --- a/numpy/core/src/multiarray/scalarapi.c
> +++ b/numpy/core/src/multiarray/scalarapi.c
> @@ -644,7 +644,20 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
> #if PY_VERSION_HEX >= 0x03030000
> if (type_num == NPY_UNICODE) {
> PyObject *b, *args;
> - b = PyBytes_FromStringAndSize(data, itemsize);
> + if (swap) {
> + char *buffer;
> + buffer = malloc(itemsize);
> + if (buffer == NULL) {
> + PyErr_NoMemory();
> + }
> + memcpy(buffer, data, itemsize);
> + byte_swap_vector(buffer, itemsize, 4);
> + b = PyBytes_FromStringAndSize(buffer, itemsize);
> + // We have to deallocate this later, otherwise we get a segfault...
> + //free(buffer);
> + } else {
> + b = PyBytes_FromStringAndSize(data, itemsize);
> + }
> if (b == NULL) {
> return NULL;
> }
>
> This particular implementation still fails though:
>
>
>>>> from numpy import array
>>>> a = array(["abc"])
>>>> b = a.newbyteorder()
>>>> a.dtype
> dtype('<U3')
>>>> b.dtype
> dtype('>U3')
>>>> a[0].dtype
> dtype('<U3')
>>>> b[0].dtype
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>>>> a[0] == b[0]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>>>> a[0]
> 'abc'
>>>> b[0]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>
>
>
> But I think that we simply need to take into account the "swap" flag.
Ok, so first of all, I tried to disable the swapping in Python 3.2:
if (swap) {
byte_swap_vector(buffer, itemsize >> 2, 4);
}
And then it behaves *exactly* as in Python 3.3. So I am pretty sure
that the problem is right there and something
along the lines of my patch above should fix it. I had a few bugs
there, here is the correct version:
diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
index c134aed..bed73f7 100644
--- a/numpy/core/src/multiarray/scalarapi.c
+++ b/numpy/core/src/multiarray/scalarapi.c
@@ -644,7 +644,19 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
#if PY_VERSION_HEX >= 0x03030000
if (type_num == NPY_UNICODE) {
PyObject *b, *args;
- b = PyBytes_FromStringAndSize(data, itemsize);
+ if (swap) {
+ char *buffer;
+ buffer = malloc(itemsize);
+ if (buffer == NULL) {
+ PyErr_NoMemory();
+ }
+ memcpy(buffer, data, itemsize);
+ byte_swap_vector(buffer, itemsize >> 2, 4);
+ b = PyBytes_FromStringAndSize(buffer, itemsize);
+ free(buffer);
+ } else {
+ b = PyBytes_FromStringAndSize(data, itemsize);
+ }
if (b == NULL) {
return NULL;
}
That works well, except that it gives the UnicodeDecodeError:
>>> b[0].dtype
NULL
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
codepoint not in range(0x110000)
This error is actually triggered by this line:
obj = type->tp_new(type, args, NULL);
in the patch by Stefan above. So I think what is happening is that it
simply tries to convert it from bytes
to a string and fails. That makes great sense. The question is why
doesn't it fail in exactly the same way
in Python 3.2? I think it's because the conversion check is bypassed
somehow. Stefan, I think
we need to swap it after the object is created. I am still
experimenting with this.
Ondrej
More information about the NumPy-Discussion
mailing list