[Numpy-discussion] Re: scipy.stats.itemfreq: overflow with add.reduce

Wed Dec 21 23:54:04 EST 2005

Tim Churches schrieb:
> Hans Georg Krauthaeuser wrote:
> 
>>Hans Georg Krauthaeuser schrieb:
>>
>>
>>>Hi All,
>>>
>>>I was playing with scipy.stats.itemfreq when I observed the following 
>>>overflow:
>>>
>>>In [119]:for i in [254,255,256,257,258]:
>>>  .....:    l=[0]*i
>>>  .....:    print i, stats.itemfreq(l), l.count(0)
>>>  .....:
>>>254 [ [  0 254]] 254
>>>255 [ [  0 255]] 255
>>>256 [ [0 0]] 256
>>>257 [ [0 1]] 257
>>>258 [ [0 2]] 258
>>>
>>>itemfreq is pretty small (in stats.py):
>>>
>>>----------------------------------------------------------------------
>>>def itemfreq(a):
>>>   """
>>>Returns a 2D array of item frequencies.  Column 1 contains item values,
>>>column 2 contains their respective counts.  Assumes a 1D array is passed.
>>>
>>>Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies)
>>>"""
>>>   scores = _support.unique(a)
>>>   scores = sort(scores)
>>>   freq = zeros(len(scores))
>>>   for i in range(len(scores)):
>>>       freq[i] = add.reduce(equal(a,scores[i]))
>>>   return array(_support.abut(scores, freq))
>>>----------------------------------------------------------------------
>>>
>>>It seems that add.reduce is the source for the overflow:
>>>
>>>In [116]:from scipy import *
>>>
>>>In [117]:for i in [254,255,256,257,258]:
>>>  .....:    l=[0]*i
>>>  .....:    print i, add.reduce(equal(l,0))
>>>  .....:
>>>254 254
>>>255 255
>>>256 0
>>>257 1
>>>258 2
>>>
>>>Is there any possibility to avoid the overflow?
> 
> 
> Apropos the preceding, herewith a thread on the Numpy list from a more
> than a few months ago. The take-home message is that for integer arrays,
> add.reduce is very fast at producing results which fall into two
> categories: a) correct or b) incorrect due to overflow. Unfortunately
> there is no equally quick method of determining into which of these two
> categories any specific result returned by add.reduce falls.
> 
> Tim C
> 
> From: Tim Churches <tchur at optushome.com.au>
> To: Todd Miller <jmiller at stsci.edu>
> Date: Apr 12 2005 - 7:51am
> 
> Todd Miller wrote:
> 
>>On Sun, 2005-04-10 at 10:23 +1000, Tim Churches wrote:
>>
>>
>>>I just got caught by code equivalent to this (with NumPy 23.8 on 32 bit
>>>Linux):
>>>
>>>
>>>>>>import Numeric as N
>>>>>>a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>>N.add.reduce(a)
>>>
>>>-1294967296
>>>
>>>OK, it is an elementary mistake, but the silent overflow caught me
>>>unawares. casting the array to Float64 before summing it avoids the
>>>error, but in my instance the actual data is a rank-1 array of 21
>>>million integers with a mean value of about 140 (which adds up more than
>>>sys.maxint), and casting to Float64 will use quite a lot of memory (as
>>>well as taking some time).
>>>
>>>Any advice for catching or avoiding such overflow without always
>>>incurring a performance and memory hit by always casting to Float64?
>>
>>
>>Here's what numarray does:
>>
>>
>>
>>>>>import numarray as N
>>>>>a = N.array((2000000000,1000000000),typecode=N.Int32)
>>>>>N.add.reduce(a)
>>
>>-1294967296
>>
>>So basic reductions in numarray have the same "careful while you're
>>shaving" behavior as Numeric; it's fast but easy to screw up.
> 
> 
> Sure, but how does one be careful? It seems that for any array of two
> integers or more which could sum to more than sys.maxint or less than
> -sys.maxint, add.reduce() in both NumPy and Numeric will give either a)
> the correct answer or b) the incorrect answer, and short of adding up
> the array using a safer but much slower method, there is no way of
> determining if the answer provided (quickly) by add.reduce is right or
> wrong? Which seems to make it fast but useless (for integer arrays, at
> least? Is that an unfair summary? Can anyone point me towards a method
> for using add.reduce() on small arrays of large integers with values in
> the billions, or on large arrays of fairly small integer values, which
> will not suddenly and without warning give the wrong answer?
> 
> 
>>But:
>>
>>
>>
>>>>>a.sum()
>>
>>3000000000L
>>
>>
>>>>>a.sum(type='d')
>>
>>3000000000.0
>>
>>a.sum() blockwise upcasts to the largest type of kind on the fly, in
>>this case, Int64. This avoids the storage overhead of typecasting the
>>entire array.
> 
> 
> That's on a 64-bit platform, right? The same method could be used to
> cast the accumulator to a Float64 on a 32-bit platform to avoid casting
> the entire array?
> 
> 
>>A better name for the method would have been sumall() since it sums all
>>elements of a multi-dimensional array. The flattening process reduces
>>on one dimension before flattening preventing a full copy of a
>>discontiguous array. It could be smarter about choosing the dimension
>>of the initial reduction.
> 
> 
> OK, thanks. Unfortunately it is not possible for us to port our
> application to numarray at the moment. But the insight is most helpful.
> 
> Tim C
Tim,

Thank you very much for your answer. I posted two follow-ups to my own 
post on c.l.p 
(http://groups.google.de/group/comp.lang.python/browse_thread/thread/a96c404d73d71259/7769fca1fa562718?hl=de#7769fca1fa562718)

A least for scipy version '0.3.2' the problem comes directly from the 
ufunc 'add' for bool arrays:

In [178]:array(array([1],'b')*2)
Out[178]:array([2],'i')

In [179]:array(array([1],'b')+array([1],'b'))
Out[179]:array([2],'b')

'multiply' changes the typecode.

So, in this case

a+a != a*2 if a is an array with bytecode 'b'.

Regards
Hans Georg Krauthäuser
-- 
Otto-von-Guericke-Universitaet Magdeburg
IGET            |   fon: +49 391 67 12195
Postfach 4120   |   fax: +49 391 67 11236
39016 Magdeburg | email: hgk at et.Uni-Magdeburg.DE
Germany         |   www: www.uni-magdeburg.de/krauthae