[Numpy-discussion] sampling based on running sums

Sat Jun 28 01:40:38 EDT 2008

On Fri, Jun 27, 2008 at 15:06, John Hunter <jdh2358 at gmail.com> wrote:
> I would like to find the sample points where the running sum of some
> vector

All non-negative, right?

> exceeds some threshold -- at those points I want to collect all
> the data in the vector since the last time the criteria was reached
> and compute some stats on it.  For example, in python
>
>    tot = 0.
>    xs = []
>    ys = []
>
>    samples1 = []
>    for thisx, thisy in zip(x, y):
>        tot += thisx
>        xs.append(thisx)
>        ys.append(thisy)
>        if tot>=threshold:
>            samples1.append(func(xs,ys))
>            tot = 0.
>            xs = []
>            ys = []
>
>
> The following is close in numpy
>
>    sx = np.cumsum(x)
>    n = (sx/threshold).astype(int)
>    ind = np.nonzero(np.diff(n)>0)[0]+1
>
>    lasti = 0
>    samples2 = []
>    for i in ind:
>        xs = x[lasti:i+1]
>        ys = y[lasti:i+1]
>        samples2.append(func(xs, ys))
>        lasti = i
>
> But the sample points in ind do no guarantee that at least threshold
> points are between the sample points due to truncation error.

Truncation error?

One reason you get different results between the two is that you are
finding the locations where the sum exceeds an integer multiple of the
threshold *starting from 0*. In the pure-Python version, you reset the
count to 0 every time you hit the threshold.

Assuming that the data are all non-negative, the cumsum is sorted. So
use searchsorted() to find the next index the threshold is exceeded.
Separate the data into two parts by the index. Use the below part to
compute your function. Shift the above part down by the last value in
the below part to reset the sum. Rinse and repeat with the shifted
above part.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco