Re: [Numpy-discussion] #2522 numpy.diff fails on unsigned integers
On 2014-11-04 19:44, Charles R Harris wrote:
On Tue, Nov 4, 2014 at 11:19 AM, Sebastian
wrote: On 2014-11-04 15:06, Todd wrote:
On Tue, Nov 4, 2014 at 2:50 PM, Sebastian Wagner
mailto:sebix@sebix.at> wrote:
Hello,
I want to bring up Issue #2522 'numpy.diff fails on unsigned integers (Trac #1929)' [1], as it was resonsible for an error in one of our programs. Short explanation of the bug: np.diff performs a subtraction on the input array. If this is of type uint and the data contains falling data, it results in an artihmetic underflow.
np.diff(np.array([0,1,0], dtype=np.uint8)) array([ 1, 255], dtype=uint8)
@charris proposed either - a note to the doc string and maybe an example to clarify things - or raise a warning but with a discussion on the list.
I would like to start it now, as it is an error which is not easily detectable (no errors or warnings are thrown). In our case the type of a data sequence, with only zeros and ones, had type f8 as also every other one, has been changed to u4. As the programs looked for values ==1 and ==-1, it broke silently. In my opinion, a note in the docs is not enough and does not help if the type changed or set after the program has been written. I'd go for automatic upcasting of uints by default and an option to turn it off, if this behavior is explicitly wanted. This wouldn't be correct from the point of view of a programmer, but as most of the users have a scientific background who excpect it 'to work', instead of sth is theoretically correct but not convenient. (I count myself to the first group)
When you say "automatic upcasting", that would be, for example uint8 to int16? What about for uint64? There is no int128. The upcast should go to the next bigger, otherwise it would again result in wrong values. uint64 we can't do that, so it has to stay. Also, when you say "by default", is this only when an overflow is detected, or always? I don't know how I could detect an overflow in the diff-function. In subtraction it should be possible, but that's very deep in the numpy-internals. How would the option to turn it off be implemented? An argument to np.diff or some sort of global option? I thought of a parameter upcast_int=True for the function.
Could check for non-decreasing sequence in the unsigned case. Note that differences of signed integers can also overflow. One way to check in general is to determine the expected sign using comparisons.
I think you mean a decreasing/non-increasing instead of non-decreasing sequence? It's also the same check as checking for a sorted sequence. But I currently don't know how I could do that efficiently without np.diff in Python, in Cython it should be easily possible. np.gradient has the same problem:
np.random.seed(89) d = np.random.randint(0,2,size=10).astype(np.uint8); d array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0], dtype=uint8) np.diff(d) array([255, 0, 1, 255, 1, 0, 255, 0, 0], dtype=uint8) np.gradient(d) array([ 255. , 127.5, 0.5, 0. , 0. , 0.5, 127.5, 127.5, 0. , 0. ])
--- gpg --keyserver keys.gnupg.net --recv-key DC9B463B
On Thu, Nov 13, 2014 at 8:10 AM, Sebastian
On 2014-11-04 19:44, Charles R Harris wrote:
On Tue, Nov 4, 2014 at 11:19 AM, Sebastian
wrote: On 2014-11-04 15:06, Todd wrote:
On Tue, Nov 4, 2014 at 2:50 PM, Sebastian Wagner
mailto:sebix@sebix.at> wrote:
Hello,
I want to bring up Issue #2522 'numpy.diff fails on unsigned integers (Trac #1929)' [1], as it was resonsible for an error in one of our programs. Short explanation of the bug: np.diff performs a subtraction on the input array. If this is of type uint and the data contains falling data, it results in an artihmetic underflow.
> np.diff(np.array([0,1,0], dtype=np.uint8)) array([ 1, 255], dtype=uint8)
@charris proposed either - a note to the doc string and maybe an example to clarify things - or raise a warning but with a discussion on the list.
I would like to start it now, as it is an error which is not easily detectable (no errors or warnings are thrown). In our case the type of a data sequence, with only zeros and ones, had type f8 as also every other one, has been changed to u4. As the programs looked for values ==1 and ==-1, it broke silently. In my opinion, a note in the docs is not enough and does not help if the type changed or set after the program has been written. I'd go for automatic upcasting of uints by default and an option to turn it off, if this behavior is explicitly wanted. This wouldn't be correct from the point of view of a programmer, but as most of the users have a scientific background who excpect it 'to work', instead of sth is theoretically correct but not convenient. (I count myself to the first group)
When you say "automatic upcasting", that would be, for example uint8 to int16? What about for uint64? There is no int128. The upcast should go to the next bigger, otherwise it would again result in wrong values. uint64 we can't do that, so it has to stay. Also, when you say "by default", is this only when an overflow is detected, or always? I don't know how I could detect an overflow in the diff-function. In subtraction it should be possible, but that's very deep in the numpy-internals. How would the option to turn it off be implemented? An argument to np.diff or some sort of global option? I thought of a parameter upcast_int=True for the function.
Could check for non-decreasing sequence in the unsigned case. Note that differences of signed integers can also overflow. One way to check in general is to determine the expected sign using comparisons.
I think you mean a decreasing/non-increasing instead of non-decreasing sequence? It's also the same check as checking for a sorted sequence. But I currently don't know how I could do that efficiently without np.diff in Python, in Cython it should be easily possible.
np.gradient has the same problem:
np.random.seed(89) d = np.random.randint(0,2,size=10).astype(np.uint8); d array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0], dtype=uint8) np.diff(d) array([255, 0, 1, 255, 1, 0, 255, 0, 0], dtype=uint8) np.gradient(d) array([ 255. , 127.5, 0.5, 0. , 0. , 0.5, 127.5, 127.5, 0. , 0. ])
Consider it is generally an error, might it be good to have a general warning built into the int dtypes regarding overflow errors? That warning can then be caught by the diff function.
On Wed, Nov 12, 2014 at 11:10 PM, Sebastian
On 2014-11-04 19:44, Charles R Harris wrote:
On Tue, Nov 4, 2014 at 11:19 AM, Sebastian
wrote: On 2014-11-04 15:06, Todd wrote:
On Tue, Nov 4, 2014 at 2:50 PM, Sebastian Wagner
mailto:sebix@sebix.at> wrote:
Hello,
I want to bring up Issue #2522 'numpy.diff fails on unsigned integers (Trac #1929)' [1], as it was resonsible for an error in one of our programs. Short explanation of the bug: np.diff performs a subtraction on the input array. If this is of type uint and the data contains falling data, it results in an artihmetic underflow.
> np.diff(np.array([0,1,0], dtype=np.uint8)) array([ 1, 255], dtype=uint8)
@charris proposed either - a note to the doc string and maybe an example to clarify things - or raise a warning but with a discussion on the list.
I would like to start it now, as it is an error which is not easily detectable (no errors or warnings are thrown). In our case the type of a data sequence, with only zeros and ones, had type f8 as also every other one, has been changed to u4. As the programs looked for values ==1 and ==-1, it broke silently. In my opinion, a note in the docs is not enough and does not help if the type changed or set after the program has been written. I'd go for automatic upcasting of uints by default and an option to turn it off, if this behavior is explicitly wanted. This wouldn't be correct from the point of view of a programmer, but as most of the users have a scientific background who excpect it 'to work', instead of sth is theoretically correct but not convenient. (I count myself to the first group)
When you say "automatic upcasting", that would be, for example uint8 to int16? What about for uint64? There is no int128. The upcast should go to the next bigger, otherwise it would again result in wrong values. uint64 we can't do that, so it has to stay. Also, when you say "by default", is this only when an overflow is detected, or always? I don't know how I could detect an overflow in the diff-function. In subtraction it should be possible, but that's very deep in the numpy-internals. How would the option to turn it off be implemented? An argument to np.diff or some sort of global option? I thought of a parameter upcast_int=True for the function.
Could check for non-decreasing sequence in the unsigned case. Note that differences of signed integers can also overflow. One way to check in general is to determine the expected sign using comparisons.
I think you mean a decreasing/non-increasing instead of non-decreasing sequence? It's also the same check as checking for a sorted sequence. But I currently don't know how I could do that efficiently without np.diff in Python, in Cython it should be easily possible.
For a 1D array, you could simply do: if a.dtype.kind == 'u' and np.any(a[1:] < a[:-1]): # warn, upcast, or whatever return a[1:] - a[:-1] The general case is a little more complicated because of the axis handling, but if you look at the source https://github.com/numpy/numpy/blob/v1.9.1/numpy/lib/function_base.py#L1055, you just need to replace 1: with slice1 and :-1 with slice2.
np.gradient has the same problem:
np.random.seed(89) d = np.random.randint(0,2,size=10).astype(np.uint8); d array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0], dtype=uint8) np.diff(d) array([255, 0, 1, 255, 1, 0, 255, 0, 0], dtype=uint8) np.gradient(d) array([ 255. , 127.5, 0.5, 0. , 0. , 0.5, 127.5, 127.5, 0. , 0. ])
The case of gradient is kind of sad, because it actually goes through the trouble of figuring out the right dtype for the output if yt isn't floating point already, but doesn't upcast the operations... Take a look at the source here, https://github.com/numpy/numpy/blob/v1.9.1/numpy/lib/function_base.py#L886, and wherever you find something like out[slice1] = (y[slice2] - y[slice3]) simply replace it with: np.subtract(y[slice2], y[slice3], dtype=dtype, out=out[slice1]) and it should work fine.
--- gpg --keyserver keys.gnupg.net --recv-key DC9B463B _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
participants (3)
-
Jaime Fernández del Río
-
Sebastian
-
Todd