[Numpy-discussion] Status of NumPy and Python 3.3

Sat Jul 28 21:09:23 EDT 2012

On Sat, Jul 28, 2012 at 5:09 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
> On Sat, Jul 28, 2012 at 3:31 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
>> On Sat, Jul 28, 2012 at 3:04 PM, Ondřej Čertík <ondrej.certik at gmail.com> wrote:
>>> Many of the failures in
>>> https://gist.github.com/3194707/5696c8d3091b16ba8a9f00a921d512ed02e94d71
>>> are of the type:
>>>
>>> ======================================================================
>>> FAIL: Check byteorder of single-dimensional objects
>>> ----------------------------------------------------------------------
>>> Traceback (most recent call last):
>>>   File "/home/ondrej/py33/lib/python3.3/site-packages/numpy/core/tests/test_unicode.py",
>>> line 286, in test_valuesSD
>>>     self.assertTrue(ua[0] != ua2[0])
>>> AssertionError: False is not true
>>>
>>>
>>> and those are caused by the following minimal example:
>>>
>>> Python 3.2:
>>>
>>>>>> from numpy import array
>>>>>> a = array(["abc"])
>>>>>> b = a.newbyteorder()
>>>>>> a.dtype
>>> dtype('<U3')
>>>>>> b.dtype
>>> dtype('>U3')
>>>>>> a[0].dtype
>>> dtype('<U3')
>>>>>> b[0].dtype
>>> dtype('<U6')
>>>>>> a[0] == b[0]
>>> False
>>>>>> a[0]
>>> 'abc'
>>>>>> b[0]
>>> 'ៀ\udc00埀\udc00韀\udc00'
>>>
>>>
>>> Python 3.3:
>>>
>>>
>>>>>> from numpy import array
>>>>>> a = array(["abc"])
>>>>>> b = a.newbyteorder()
>>>>>> a.dtype
>>> dtype('<U3')
>>>>>> b.dtype
>>> dtype('>U3')
>>>>>> a[0].dtype
>>> dtype('<U3')
>>>>>> b[0].dtype
>>> dtype('<U3')
>>>>>> a[0] == b[0]
>>> True
>>>>>> a[0]
>>> 'abc'
>>>>>> b[0]
>>> 'abc'
>>>
>>>
>>> So somehow the newbyteorder() method doesn't change the dtype of the
>>> elements in our new code.
>>> This method is implemented in numpy/core/src/multiarray/descriptor.c
>>> (I think), but so far I don't see
>>> where the problem could be.
>>>
>>> Any ideas?
>>
>> Ok, after some investigating, I think we need to do something along these lines:
>>
>> diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
>> index c134aed..daf7fc4 100644
>> --- a/numpy/core/src/multiarray/scalarapi.c
>> +++ b/numpy/core/src/multiarray/scalarapi.c
>> @@ -644,7 +644,20 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
>>  #if PY_VERSION_HEX >= 0x03030000
>>      if (type_num == NPY_UNICODE) {
>>          PyObject *b, *args;
>> -        b = PyBytes_FromStringAndSize(data, itemsize);
>> +        if (swap) {
>> +            char *buffer;
>> +            buffer = malloc(itemsize);
>> +            if (buffer == NULL) {
>> +                PyErr_NoMemory();
>> +            }
>> +            memcpy(buffer, data, itemsize);
>> +            byte_swap_vector(buffer, itemsize, 4);
>> +            b = PyBytes_FromStringAndSize(buffer, itemsize);
>> +            // We have to deallocate this later, otherwise we get a segfault...
>> +            //free(buffer);
>> +        } else {
>> +            b = PyBytes_FromStringAndSize(data, itemsize);
>> +        }
>>          if (b == NULL) {
>>              return NULL;
>>          }
>>
>> This particular implementation still fails though:
>>
>>
>>>>> from numpy import array
>>>>> a = array(["abc"])
>>>>> b = a.newbyteorder()
>>>>> a.dtype
>> dtype('<U3')
>>>>> b.dtype
>> dtype('>U3')
>>>>> a[0].dtype
>> dtype('<U3')
>>>>> b[0].dtype
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
>> codepoint not in range(0x110000)
>>>>> a[0] == b[0]
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
>> codepoint not in range(0x110000)
>>>>> a[0]
>> 'abc'
>>>>> b[0]
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
>> codepoint not in range(0x110000)
>>
>>
>>
>> But I think that we simply need to take into account the "swap" flag.
>
> Ok, so first of all, I tried to disable the swapping in Python 3.2:
>
>                 if (swap) {
>                     byte_swap_vector(buffer, itemsize >> 2, 4);
>                 }
>
> And then it behaves *exactly* as in Python 3.3. So I am pretty sure
> that the problem is right there and something
> along the lines of my patch above should fix it. I had a few bugs
> there, here is the correct version:
>
> diff --git a/numpy/core/src/multiarray/scalarapi.c b/numpy/core/src/multiarray/s
> index c134aed..bed73f7 100644
> --- a/numpy/core/src/multiarray/scalarapi.c
> +++ b/numpy/core/src/multiarray/scalarapi.c
> @@ -644,7 +644,19 @@ PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *
>  #if PY_VERSION_HEX >= 0x03030000
>      if (type_num == NPY_UNICODE) {
>          PyObject *b, *args;
> -        b = PyBytes_FromStringAndSize(data, itemsize);
> +        if (swap) {
> +            char *buffer;
> +            buffer = malloc(itemsize);
> +            if (buffer == NULL) {
> +                PyErr_NoMemory();
> +            }
> +            memcpy(buffer, data, itemsize);
> +            byte_swap_vector(buffer, itemsize >> 2, 4);
> +            b = PyBytes_FromStringAndSize(buffer, itemsize);
> +            free(buffer);
> +        } else {
> +            b = PyBytes_FromStringAndSize(data, itemsize);
> +        }
>          if (b == NULL) {
>              return NULL;
>          }
>
>
> That works well, except that it gives the UnicodeDecodeError:
>
>>>> b[0].dtype
> NULL
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
> codepoint not in range(0x110000)
>
> This error is actually triggered by this line:
>
>
>         obj = type->tp_new(type, args, NULL);
>
> in the patch by Stefan above. So I think what is happening is that it
> simply tries to convert it from bytes
> to a string and fails. That makes great sense. The question is why
> doesn't it fail in exactly the same way
> in Python 3.2? I think it's because the conversion check is bypassed
> somehow. Stefan, I think
> we need to swap it after the object is created. I am still
> experimenting with this.

Well, I simply went to the Python sources and then implemented a
solution that works with this patch:

https://github.com/certik/numpy/commit/36fcd1327746a3d0ad346ce58ffbe00506e27654

So now the PR actually seems to work. The rest of the failures are here:

https://gist.github.com/3195520

and they seem to be unrelated. Can somebody please review this PR?

https://github.com/numpy/numpy/pull/366

I will squash the commits after it's reviewed (I want to keep the
history there for now).

Ondrej