[Numpy-discussion] Status of NumPy and Python 3.3

Sat Jul 28 20:09:20 EDT 2012

On Sat, Jul 28, 2012 at 3:31 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
> On Sat, Jul 28, 2012 at 3:04 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
>> Many of the failures in
>> https://gist.github.com/3194707/5696c8d3091b16ba8a9f00a921d512ed02e94d71
>> are of the type:
>>
>> ======================================================================
>> FAIL: Check byteorder of single-dimensional objects
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py",
>> line 286, in test_valuesSD
>>     self.assertTrue(ua[0] != ua2[0])
>> AssertionError: False is not true
>>
>>
>> and those are caused by the following minimal example:
>>
>> Python 3.2:
>>
>>>>> from numpy import array
>>>>> a = array(["abc"])
>>>>> b = a.newbyteorder()
>>>>> a.dtype
>> dtype('<U3')
>>>>> b.dtype
>> dtype('>U3')
>>>>> a[0].dtype
>> dtype('<U3')
>>>>> b[0].dtype
>> dtype('<U6')
>>>>> a[0] == b[0]
>> False
>>>>> a[0]
>> 'abc'
>>>>> b[0]
>> 'ៀ\udc00埀\udc00韀\udc00'
>>
>>
>> Python 3.3:
>>
>>
>>>>> from numpy import array
>>>>> a = array(["abc"])
>>>>> b = a.newbyteorder()
>>>>> a.dtype
>> dtype('<U3')
>>>>> b.dtype
>> dtype('>U3')
>>>>> a[0].dtype
>> dtype('<U3')
>>>>> b[0].dtype
>> dtype('<U3')
>>>>> a[0] == b[0]
>> True
>>>>> a[0]
>> 'abc'
>>>>> b[0]
>> 'abc'
>>
>>
>> So somehow the newbyteorder() method doesn't change the dtype of the
>> elements in our new code.
>> This method is implemented in numpy/core/src/multiarray/descriptor.c
>> (I think), but so far I don't see
>> where the problem could be.
>>
>> Any ideas?
>
> Ok, after some investigating, I think we need to do something along these lines:
>
> diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
> index c134aed..daf7fc4 100644
> --- a/numpy/core/src/multiarray/scalarapi.c
> +++ b/numpy/core/src/multiarray/scalarapi.c
> @@ -644,7 +644,20 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
>  #if PY_VERSION_HEX >= 0x03030000
>      if (type_num == NPY_UNICODE) {
>          PyObject *b, *args;
> -        b = PyBytes_FromStringAndSize(data, itemsize);
> +        if (swap) {
> +            char *buffer;
> +            buffer = malloc(itemsize);
> +            if (buffer == NULL) {
> +                PyErr_NoMemory();
> +            }
> +            memcpy(buffer, data, itemsize);
> +            byte_swap_vector(buffer, itemsize, 4);
> +            b = PyBytes_FromStringAndSize(buffer, itemsize);
> +            // We have to deallocate this later, otherwise we get a segfault...
> +            //free(buffer);
> +        } else {
> +            b = PyBytes_FromStringAndSize(data, itemsize);
> +        }
>          if (b == NULL) {
>              return NULL;
>          }
>
> This particular implementation still fails though:
>
>
>>>> from numpy import array
>>>> a = array(["abc"])
>>>> b = a.newbyteorder()
>>>> a.dtype
> dtype('<U3')
>>>> b.dtype
> dtype('>U3')
>>>> a[0].dtype
> dtype('<U3')
>>>> b[0].dtype
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>>>> a[0] == b[0]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>>>> a[0]
> 'abc'
>>>> b[0]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>
>
>
> But I think that we simply need to take into account the "swap" flag.

Ok, so first of all, I tried to disable the swapping in Python 3.2:

                if (swap) {
                    byte_swap_vector(buffer, itemsize >> 2, 4);
                }

And then it behaves *exactly* as in Python 3.3. So I am pretty sure
that the problem is right there and something
along the lines of my patch above should fix it. I had a few bugs
there, here is the correct version:

diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
index c134aed..bed73f7 100644
--- a/numpy/core/src/multiarray/scalarapi.c
+++ b/numpy/core/src/multiarray/scalarapi.c
@@ -644,7 +644,19 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
 #if PY_VERSION_HEX >= 0x03030000
     if (type_num == NPY_UNICODE) {
         PyObject *b, *args;
-        b = PyBytes_FromStringAndSize(data, itemsize);
+        if (swap) {
+            char *buffer;
+            buffer = malloc(itemsize);
+            if (buffer == NULL) {
+                PyErr_NoMemory();
+            }
+            memcpy(buffer, data, itemsize);
+            byte_swap_vector(buffer, itemsize >> 2, 4);
+            b = PyBytes_FromStringAndSize(buffer, itemsize);
+            free(buffer);
+        } else {
+            b = PyBytes_FromStringAndSize(data, itemsize);
+        }
         if (b == NULL) {
             return NULL;
         }


That works well, except that it gives the UnicodeDecodeError:

>>> b[0].dtype
NULL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
codepoint not in range(0x110000)

This error is actually triggered by this line:


        obj = type->tp_new(type, args, NULL);

in the patch by Stefan above. So I think what is happening is that it
simply tries to convert it from bytes
to a string and fails. That makes great sense. The question is why
doesn't it fail in exactly the same way
in Python 3.2? I think it's because the conversion check is bypassed
somehow. Stefan, I think
we need to swap it after the object is created. I am still
experimenting with this.

Ondrej