[Python-ideas] Batching/grouping function for itertools

Nick Coghlan ncoghlan at gmail.com
Sun Dec 8 12:05:18 CET 2013


On 8 December 2013 19:25, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Dec 08, 2013 at 05:02:05PM +1000, Nick Coghlan wrote:
>
>> The windowing problem is too ill-defined - there are enough degrees of
>> freedom that any API flexible enough to cover them all is harder to
>> learn than just building out your own version that works the way you
>> want it to, and a more restrictive API that *doesn't* cover all the
>> variants introduces a sharp discontinuity between the "blessed"
>> variant and the alternatives.
>
> Playing Devil's Advocate here, I wonder if that is true though. It seems
> to me that there are two basic windowing variants: sliding windows, and
> discrete windows. That is, given a sequence [a, b, c, d, e, f, g, h, i]
> and a window size of 3, the two obvious, common results are:
>
> # sliding window
> (a,b,c), (b,c,d), (c,d,e), (d,e,f), (e,f,g), (f,g,h), (g,h,i)
>
> # discrete windows
> (a,b,c), (d,e,f), (g,h,i)
>
>
> Surely anything else is exotic enough that there is no question about
> leaving it up to the individual programmer.
>
> In the second case, there is a question about what to do with sequences
> that are not a multiple of the window size. Similar to zip(), there are
> two things one might do:
>
> - pad with some given object;
> - raise an exception
>
> If you want to just ignore extra items, just catch the exception and
> continue. So that's a maximum of three window functions:
>
> sliding(iterable, window_size)
> discrete(iterable, window_size, pad=None)
> strict_discrete(iterable, window_size)
>
> or just two, if you combine discrete and strict_discrete:
>
> discrete(iterable, window_size [, pad])
> # raise if pad is not given
>
> What other varieties are there? Surely none that are common. Once, for a
> lark, I tried to come up with one that was fully general -- as well as a
> window size, you could specify how far to advance the window each step.
> The sliding variety would advance by 1 each step, the discrete variety
> would advance by the window size. But I never found any reason to use it
> with any other step sizes. Surely anything else is more useful in theory
> than in practice.
>
> (That's three times I've said something is "surely" true, always a sign
> my argument is weak *grin*)

I'm biased by a signal processing background where playing games with
data windows and the amount of overlap between samples is a *really*
common technique :)

> Given that this windowing problem keeps coming up, there's no doubt in
> my mind that it is a useful, if not fundamental, iterator operation.
> Ruby's Enumerable module includes both:
>
> http://ruby-doc.org/core-2.0.0/Enumerable.html
>
> each_cons is what I've been calling a sliding window, and each_slice is
> what I've been calling discrete chunks.

The two examples in the itertools docs are currently just pairwise
(sliding window of length 2) and grouper (distinct windows of
arbitrary length, always padded)

The general cases would be:

    def sliding_window(iterable, n):
        """Return a sliding window over the data, introducing one new
item into each window"""
        iterables = tee(iterable, n)
        # Prime the iterables
        for i, itr in iterables:
            for __ in range(i):
                next(itr, None)
        return zip(*iterables)

    def discrete_window(iterable, n, fillvalue=None):
        """Return distinct windows of the data, padding the last
window if necessary"""
        repeated_iterable = [iter(iterable)] * n
        return zip_longest(*repeated_iterable, fillvalue=fillvalue)

Given the padding version of discrete_window, the exception raising
version is just:

    def discrete_window_no_padding(iterable, n):
        sentinel = object()
        for x in discrete_window(iterable, n, sentinel):
            if x[-1] is sentinel: raise ValueError("Ragged final partition")
            yield x

Given the "n-1" overlapping version of sliding window, the "selective
overlap" version (ignoring any leftover data at the end) is just:

    def sliding_window_with_configurable_overlap(iterable, n, new_items=1):
        if new_items == 1:
            return sliding_window(iterable, n)
        return islice(sliding_window(iterable, n), 0, None, new_items)

The main argument in favour of offering sliding_window and
discrete_window as itertools is that they each rely on a sophisticated
trick in iterator state manipulation:

- the sliding window relies on using tee() and then priming the
results to start at the appropriate place in the initial window
- the discrete window relies on using *multiple* reference to a single
iterator and exploiting the fact that the iterator advances each time
a value is retrieved

That's a deeper understanding of the object model than most people
will have, so they're likely to just cargo cult the recipe from the
docs anyway, without really trying to understand exactly how it works.

I guess I'm +0 rather than -1 at this point, but it's really Raymond
that needs to be convinced as module maintainer (other points of note:
such a change wouldn't be possible until Python 3.5 anyway, and it
would also require restructuring itertools to be a hybrid C/Python
module, since writing these in C wouldn't offer any significant
benefits - the inner loops in the pure Python versions are already
using existing high speed iterators).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list