weird searchsorted behavior for unicode array
I am seeing some very strange behavior searching a unicode array. The attached code outputs the following: UNICODE Is sorted: True Search sorted by iteration, left: [0, 1, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by iteration, right: [0, 2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] If I remove the first print, it produces: Is sorted: True Search sorted by iteration, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by iteration, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing with copy, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Neither answer is correct, since left and right should be offset by 1 when searching for an element in the array, by my reading of the docs. This is numpy 1.6.1 on OSX 10.6, python 2.7 Am I missing something? Thanks, Ray Jones
On Wed, Mar 28, 2012 at 10:55 AM, Thouis (Ray) Jones <thouis@gmail.com> wrote:
I am seeing some very strange behavior searching a unicode array. The attached code outputs the following: UNICODE Is sorted: True Search sorted by iteration, left: [0, 1, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by iteration, right: [0, 2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13]
If I remove the first print, it produces: Is sorted: True Search sorted by iteration, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by iteration, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing with copy, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Neither answer is correct, since left and right should be offset by 1 when searching for an element in the array, by my reading of the docs.
This is numpy 1.6.1 on OSX 10.6, python 2.7
Am I missing something?
adding this # -*- coding: utf-8 -*- produces consistent results for me maybe the regex for encoding, but I thought it has to be the first line Josef
Thanks, Ray Jones
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Mar 28, 2012 at 11:51 AM, <josef.pktd@gmail.com> wrote:
On Wed, Mar 28, 2012 at 10:55 AM, Thouis (Ray) Jones <thouis@gmail.com> wrote:
I am seeing some very strange behavior searching a unicode array. The attached code outputs the following: UNICODE Is sorted: True Search sorted by iteration, left: [0, 1, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by iteration, right: [0, 2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13]
If I remove the first print, it produces: Is sorted: True Search sorted by iteration, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by iteration, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing, left: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing, right: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13] Search sorted by indexing with copy, left: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Search sorted by indexing with copy, right: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Neither answer is correct, since left and right should be offset by 1 when searching for an element in the array, by my reading of the docs.
This is numpy 1.6.1 on OSX 10.6, python 2.7
Am I missing something?
adding this # -*- coding: utf-8 -*-
produces consistent results for me maybe the regex for encoding, but I thought it has to be the first line
consistent means the same with and without commenting out UNICODE, but searchsorted doesn't distinguish between left and right. that looks like a bug (in numpy 1.4.1) using an object array, or a string view a.view('<S524') produces the correct left right shift. Josef
Josef
Thanks, Ray Jones
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
It seems to be a bug in the unicode string length computation in arraytypes.c.src:UNICODE_compare(), based on comparison to the code in arrayobject.c:_myunicmp() and arrayobject.c:_compare_strings(). Patch below (against maintenance/1.6.x, but the bug also looks to be present in master based on my reading of the code). --- numpy/core/src/multiarray/arraytypes.c.src | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/numpy/core/src/multiarray/arraytypes.c.src b/numpy/core/src/multiarray/arraytypes.c.src index fde95c4..660d1e5 100644 --- a/numpy/core/src/multiarray/arraytypes.c.src +++ b/numpy/core/src/multiarray/arraytypes.c.src @@ -2789,7 +2789,7 @@ static int UNICODE_compare(PyArray_UCS4 *ip1, PyArray_UCS4 *ip2, PyArrayObject *ap) { - int itemsize = ap->descr->elsize; + int itemsize = (ap->descr->elsize) >> 2; if (itemsize < 0) { return 0; -- 1.7.9.3
On Thu, Mar 29, 2012 at 11:04, Thouis (Ray) Jones <thouis@gmail.com> wrote:
It seems to be a bug in the unicode string length computation in arraytypes.c.src:UNICODE_compare(), based on comparison to the code in arrayobject.c:_myunicmp() and arrayobject.c:_compare_strings().
Patch below (against maintenance/1.6.x, but the bug also looks to be present in master based on my reading of the code).
I just submitted a PR against numpy master for this bug, adding a test based on the example I posted. https://github.com/numpy/numpy/pull/243 Ray Jones
participants (2)
-
josef.pktd@gmail.com -
Thouis (Ray) Jones