Mailman 3 Batching/grouping function for itertools - Python-ideas

Batching/grouping function for itertools

older
Easily reference a single unittest...

Amber Yust

Dec. 8, 2013

4:44 a.m.

After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks. Existing possible solutions include both the "clever" but somewhat unreadable... batched_iter = zip(*[iter(input_iter)]*n) ...and the long-form... def batch(input_iter, n): input_iter = iter(input_iter) while True: yield [input_iter.next() for _ in range(n)] There doesn't seem, however, to be one clear "right" way to do this. Every time I come up against this task, I go back to itertools expecting one of the grouping functions there to cover it, but they don't. It seems like it would be a natural fit for itertools, and it would simplify things like processing of file formats that use a consistent number of lines per entry, et cetera. ~Amber

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Devin Jeanpierre

December 2013

5:14 a.m.

On Sat, Dec 7, 2013 at 8:44 PM, Amber Yust <amber.yust@gmail.com> wrote:

...

After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

+1. In my experience the grouper recipe in the docs serve less as a helpful example of how to use itertools and more as a thing to copy paste. That's what modules are for. -- Devin

Nick Coghlan

7:02 a.m.

On 8 December 2013 15:14, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:

...

On Sat, Dec 7, 2013 at 8:44 PM, Amber Yust <amber.yust@gmail.com> wrote:

...
After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

+1. In my experience the grouper recipe in the docs serve less as a helpful example of how to use itertools and more as a thing to copy paste. That's what modules are for.

The windowing problem is too ill-defined - there are enough degrees of freedom that any API flexible enough to cover them all is harder to learn than just building out your own version that works the way you want it to, and a more restrictive API that *doesn't* cover all the variants introduces a sharp discontinuity between the "blessed" variant and the alternatives. For anyone that thinks the stdlib itertools is too minimalist (I'm not one of them), than "pip install more-itertools" provides the recipes from the stdlib docs, as well as a few other precomposed operations. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Tal Einat

8:14 a.m.

Hello Amber, These issues -- a batching function in itertools and including the itertools recipes in the stdlib -- have both been discussed here recently. Specifically regarding the batching function, I couldn't find the most recent discussion via a quick search. IIRC the conclusion was what Nick said: different use-cases require slightly different behaviors, which can not be elegantly expressed as a single, simple and straight-forward function. Therefore, it is better to have a basic recipe in the docs, which everyone can modify according to their needs. With regard to including the other recipes in the stdlib, I recommend reading the most recent discussion on the archives [1]. The major argument against this is that these recipes are easily implemented based on the existing tools, but having all of them in the stdlib means having to support them all in the future, including maintaining backwards compatibility. Supporting stdlib code is considerably harder than having working examples in the docs. - Tal [1] https://mail.python.org/pipermail/python-ideas/2012-July/015714.html On Sun, Dec 8, 2013 at 9:02 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 8 December 2013 15:14, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:

...
On Sat, Dec 7, 2013 at 8:44 PM, Amber Yust <amber.yust@gmail.com> wrote:

...
After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

+1. In my experience the grouper recipe in the docs serve less as a helpful example of how to use itertools and more as a thing to copy paste. That's what modules are for.

The windowing problem is too ill-defined - there are enough degrees of freedom that any API flexible enough to cover them all is harder to learn than just building out your own version that works the way you want it to, and a more restrictive API that *doesn't* cover all the variants introduces a sharp discontinuity between the "blessed" variant and the alternatives.

For anyone that thinks the stdlib itertools is too minimalist (I'm not one of them), than "pip install more-itertools" provides the recipes from the stdlib docs, as well as a few other precomposed operations.

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

9:25 a.m.

On Sun, Dec 08, 2013 at 05:02:05PM +1000, Nick Coghlan wrote:

...

The windowing problem is too ill-defined - there are enough degrees of freedom that any API flexible enough to cover them all is harder to learn than just building out your own version that works the way you want it to, and a more restrictive API that *doesn't* cover all the variants introduces a sharp discontinuity between the "blessed" variant and the alternatives.

Playing Devil's Advocate here, I wonder if that is true though. It seems to me that there are two basic windowing variants: sliding windows, and discrete windows. That is, given a sequence [a, b, c, d, e, f, g, h, i] and a window size of 3, the two obvious, common results are: # sliding window (a,b,c), (b,c,d), (c,d,e), (d,e,f), (e,f,g), (f,g,h), (g,h,i) # discrete windows (a,b,c), (d,e,f), (g,h,i) Surely anything else is exotic enough that there is no question about leaving it up to the individual programmer. In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do: - pad with some given object; - raise an exception If you want to just ignore extra items, just catch the exception and continue. So that's a maximum of three window functions: sliding(iterable, window_size) discrete(iterable, window_size, pad=None) strict_discrete(iterable, window_size) or just two, if you combine discrete and strict_discrete: discrete(iterable, window_size [, pad]) # raise if pad is not given What other varieties are there? Surely none that are common. Once, for a lark, I tried to come up with one that was fully general -- as well as a window size, you could specify how far to advance the window each step. The sliding variety would advance by 1 each step, the discrete variety would advance by the window size. But I never found any reason to use it with any other step sizes. Surely anything else is more useful in theory than in practice. (That's three times I've said something is "surely" true, always a sign my argument is weak *grin*) Given that this windowing problem keeps coming up, there's no doubt in my mind that it is a useful, if not fundamental, iterator operation. Ruby's Enumerable module includes both: http://ruby-doc.org/core-2.0.0/Enumerable.html each_cons is what I've been calling a sliding window, and each_slice is what I've been calling discrete chunks. -- Steven

Nick Coghlan

11:05 a.m.

On 8 December 2013 19:25, Steven D'Aprano <steve@pearwood.info> wrote:

...

On Sun, Dec 08, 2013 at 05:02:05PM +1000, Nick Coghlan wrote:

...
The windowing problem is too ill-defined - there are enough degrees of freedom that any API flexible enough to cover them all is harder to learn than just building out your own version that works the way you want it to, and a more restrictive API that *doesn't* cover all the variants introduces a sharp discontinuity between the "blessed" variant and the alternatives.

Playing Devil's Advocate here, I wonder if that is true though. It seems to me that there are two basic windowing variants: sliding windows, and discrete windows. That is, given a sequence [a, b, c, d, e, f, g, h, i] and a window size of 3, the two obvious, common results are:

# sliding window (a,b,c), (b,c,d), (c,d,e), (d,e,f), (e,f,g), (f,g,h), (g,h,i)

# discrete windows (a,b,c), (d,e,f), (g,h,i)

Surely anything else is exotic enough that there is no question about leaving it up to the individual programmer.

In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do:

- pad with some given object; - raise an exception

If you want to just ignore extra items, just catch the exception and continue. So that's a maximum of three window functions:

sliding(iterable, window_size) discrete(iterable, window_size, pad=None) strict_discrete(iterable, window_size)

or just two, if you combine discrete and strict_discrete:

discrete(iterable, window_size [, pad]) # raise if pad is not given

What other varieties are there? Surely none that are common. Once, for a lark, I tried to come up with one that was fully general -- as well as a window size, you could specify how far to advance the window each step. The sliding variety would advance by 1 each step, the discrete variety would advance by the window size. But I never found any reason to use it with any other step sizes. Surely anything else is more useful in theory than in practice.

(That's three times I've said something is "surely" true, always a sign my argument is weak *grin*)

I'm biased by a signal processing background where playing games with data windows and the amount of overlap between samples is a *really* common technique :)

...

Given that this windowing problem keeps coming up, there's no doubt in my mind that it is a useful, if not fundamental, iterator operation. Ruby's Enumerable module includes both:

http://ruby-doc.org/core-2.0.0/Enumerable.html

each_cons is what I've been calling a sliding window, and each_slice is what I've been calling discrete chunks.

The two examples in the itertools docs are currently just pairwise (sliding window of length 2) and grouper (distinct windows of arbitrary length, always padded) The general cases would be: def sliding_window(iterable, n): """Return a sliding window over the data, introducing one new item into each window""" iterables = tee(iterable, n) # Prime the iterables for i, itr in iterables: for __ in range(i): next(itr, None) return zip(*iterables) def discrete_window(iterable, n, fillvalue=None): """Return distinct windows of the data, padding the last window if necessary""" repeated_iterable = [iter(iterable)] * n return zip_longest(*repeated_iterable, fillvalue=fillvalue) Given the padding version of discrete_window, the exception raising version is just: def discrete_window_no_padding(iterable, n): sentinel = object() for x in discrete_window(iterable, n, sentinel): if x[-1] is sentinel: raise ValueError("Ragged final partition") yield x Given the "n-1" overlapping version of sliding window, the "selective overlap" version (ignoring any leftover data at the end) is just: def sliding_window_with_configurable_overlap(iterable, n, new_items=1): if new_items == 1: return sliding_window(iterable, n) return islice(sliding_window(iterable, n), 0, None, new_items) The main argument in favour of offering sliding_window and discrete_window as itertools is that they each rely on a sophisticated trick in iterator state manipulation: - the sliding window relies on using tee() and then priming the results to start at the appropriate place in the initial window - the discrete window relies on using *multiple* reference to a single iterator and exploiting the fact that the iterator advances each time a value is retrieved That's a deeper understanding of the object model than most people will have, so they're likely to just cargo cult the recipe from the docs anyway, without really trying to understand exactly how it works. I guess I'm +0 rather than -1 at this point, but it's really Raymond that needs to be convinced as module maintainer (other points of note: such a change wouldn't be possible until Python 3.5 anyway, and it would also require restructuring itertools to be a hybrid C/Python module, since writing these in C wouldn't offer any significant benefits - the inner loops in the pure Python versions are already using existing high speed iterators). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Ron Adam

3:13 p.m.

On 12/08/2013 05:05 AM, Nick Coghlan wrote:

...

The main argument in favour of offering sliding_window and discrete_window as itertools is that they each rely on a sophisticated trick in iterator state manipulation:

How about 2 lower level building blocks that would make these things easier to make and think about. Possibly function to take the next n items of an iterator without advancing it. Along with a function to advance an iterator n ahead without taking anything. These would be simpler and easier to maintain, and have a wider range of uses. Cheers, Ron

Chris Angelico

3:22 p.m.

On Mon, Dec 9, 2013 at 2:13 AM, Ron Adam <ron3200@gmail.com> wrote:

...

How about 2 lower level building blocks that would make these things easier to make and think about.

Possibly function to take the next n items of an iterator without advancing it.

Fundamentally impossible. Here's a function that returns an iterator: def d20(): import random while True: yield random.randrange(1,21) dice_roller = d20() How are you going to take the next n items from dice_roller without advancing it? ChrisA

Stephen J. Turnbull

3:34 p.m.

Chris Angelico writes:

...

How are you going to take the next n items from dice_roller without advancing it?

Memoize.

Chris Angelico

3:40 p.m.

On Mon, Dec 9, 2013 at 2:34 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

Chris Angelico writes:

...
How are you going to take the next n items from dice_roller without advancing it?

Memoize.

That's not a building-block then, that's a quite separate feature. In this particular instance there's no way to distinguish between "predict the next three but don't advance the iterator" and "advance the iterator by three and then rewind it the same distance", but imagine an iterators that blocks for input, or something. You don't want a purportedly low-level function (from which you derive the more "usable" functions) doing memoization on that. ChrisA

Steven D'Aprano

4:45 p.m.

On Mon, Dec 09, 2013 at 12:34:28AM +0900, Stephen J. Turnbull wrote:

...

Chris Angelico writes:

...
How are you going to take the next n items from dice_roller without advancing it?

Memoize.

Er, I don't think so. How does the memoizing cache get those values if the underlying iterator isn't advanced? Obviously it can't. itertools.tee uses a cache, so we can demonstrate the issue: py> it = iter("abcde") py> wrapper = itertools.tee(it, 2)[0] py> _ = list(wrapper) If the iterator hasn't advanced, then next(it) should yield 'a'. But: py> next(it) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration Any sort of "iterator look-ahead" has a number of fundamental problems. Despite many requests, those problems are part of the reason why Python iterators don't provide a "peek" method to look ahead. Not even to look ahead a single value, let alone an arbitrary number of values. - The cache would require unbounded memory (unless you limit the look-ahead to N values); - iterators with side-effects would cause those side-effects at the wrong time; - iterators whose calculated values are time-dependent could be calculated at a different time from when they are returned, potentially giving the wrong result. For something like tee, it is difficult to see any other way other than memoisation to get the functionality needed, so we just have to live with the limitations. But offering dedicated look-ahead with caching as fundamental iterator tools, as Ron suggests, strikes me as completely the wrong thing to do if what we actually want is to group items. It doesn't solve the problem being asked, since it's still up to the caller to make their own grouper tool out of the memoising primitive. -- Steven

Ron Adam

3:40 p.m.

On 12/08/2013 09:22 AM, Chris Angelico wrote:

...

...
...
How about 2 lower level building blocks that would make these things easier to make and think about.

Possibly function to take the next n items of an iterator without advancing it. Fundamentally impossible.

In some cases yes, but you would know if it could work before you chose to use these.

...

Here's a function that returns an iterator:

def d20(): import random while True: yield random.randrange(1,21)

dice_roller = d20()

How are you going to take the next n items from dice_roller without advancing it?

If it is to work with generators too... The function would need to be a wrapper that keeps a buffer. And the front of the buffer would always be the next to be yielded if there is anything in it. In the case of advancing an generator, without taking the values, it would still need to call the __next__ methods on it, but not actually return the values. Yes, it won't be as simple as it sounds. ;-) cheers, Ron

Chris Angelico

3:46 p.m.

On Mon, Dec 9, 2013 at 2:40 AM, Ron Adam <ron3200@gmail.com> wrote:

...

The function would need to be a wrapper that keeps a buffer. And the front of the buffer would always be the next to be yielded if there is anything in it.

Which is what Stephen said in his pithy message above. Yes, it's theoretically possible, but it's not something to do in the general case. Not something to depend on for a chunker/grouper, which should be able to pass once over the underlying iterator. ChrisA

Ron Adam

4:07 p.m.

On 12/08/2013 09:46 AM, Chris Angelico wrote:

...

...
...
The function would need to be a wrapper that keeps a buffer. And the front of the buffer would always be the next to be yielded if there is anything in it. Which is what Stephen said in his pithy message above. Yes, it's

On Mon, Dec 9, 2013 at 2:40 AM, Ron Adam<ron3200@gmail.com> wrote: theoretically possible, but it's not something to do in the general case. Not something to depend on for a chunker/grouper, which should be able to pass once over the underlying iterator.

For the windowed chunker that was being discussed and uses that are similar, they need to hold references to some yielded items some place. This just packages that need in a convenient way. It almost seems to me that coroutines, which is the counter example you suggested, should maybe be a type of their own. That way, they can be tested. (And raise an error if used in cases like this.) But that's whole other topic. Best answer for now is don't do that. Cheers, Ron

Nick Coghlan

9:56 p.m.

On 9 Dec 2013 01:47, "Chris Angelico" <rosuav@gmail.com> wrote:

...

On Mon, Dec 9, 2013 at 2:40 AM, Ron Adam <ron3200@gmail.com> wrote:

...
The function would need to be a wrapper that keeps a buffer. And the

front

...

...
of the buffer would always be the next to be yielded if there is anything in it.

Which is what Stephen said in his pithy message above. Yes, it's theoretically possible, but it's not something to do in the general case. Not something to depend on for a chunker/grouper, which should be able to pass once over the underlying iterator.

tee() is this building block. Cheers, Nick.

...

ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

4:38 p.m.

On Sun, Dec 08, 2013 at 09:13:06AM -0600, Ron Adam wrote:

...

Possibly function to take the next n items of an iterator without advancing it.

Fundamentally impossible. The best you can do it advance the iterator but store the results for later use.

...

Along with a function to advance an iterator n ahead without taking anything.

Too trivial to bother with. Just advance the iterator and throw the result away. def advance(it, n): for _ in range(n): next(it) You can even do it as a one-linear, at the expense of readability: {next(it) and None for _ in range(n)}.pop() (The pop isn't really necessary, I just like the fact that it means the expression evaluates as None.)

...

These would be simpler and easier to maintain, and have a wider range of uses.

Maybe, maybe not, but they don't solve the problem that people keep asking to be solved. -- Steven

Serhiy Storchaka

11:30 a.m.

08.12.13 11:25, Steven D'Aprano написав(ла):

...

In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do:

- pad with some given object; - raise an exception

3) emit last chunk incompleted; 4) skip incomplete chunk. There is also a question about result's type. Sometimes you need an iterator of subsequences (i.e. split string on equal string chunks), sometimes an iterator of iterators is enough. I.e. at least 8 variants are needed.

Steven D'Aprano

12:16 p.m.

On Sun, Dec 08, 2013 at 01:30:56PM +0200, Serhiy Storchaka wrote:

...

08.12.13 11:25, Steven D'Aprano написав(ла):

...
In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do:

- pad with some given object; - raise an exception

3) emit last chunk incompleted;

Given a window size of two, and input data [a, b, c], are you suggesting a variety that returns this? (a,b), (c,) There is no need for a separate function for that. Given a version that takes a pad object, if the pad argument is not given, return a partial chunk at the end.

...

4) skip incomplete chunk.

The very next sentence in my post references that: "If you want to just ignore extra items, just catch the exception and continue." There is no need for an extra function covering that case.

...

There is also a question about result's type. Sometimes you need an iterator of subsequences (i.e. split string on equal string chunks), sometimes an iterator of iterators is enough.

None of the other itertools functions treat strings specially. Why should this one? If you want to re-join them into strings, you can do so with a trivial wrapper: (''.join(elements) for elements in group("some string", 3, pad=' ')) ought to do the trick, assuming group returns tuples or lists of characters. Re-combining the iterated-upon elements into the input type is not the responsibility of itertools. -- Steven

Serhiy Storchaka

5:32 p.m.

08.12.13 14:16, Steven D'Aprano написав(ла):

...

On Sun, Dec 08, 2013 at 01:30:56PM +0200, Serhiy Storchaka wrote:

...
08.12.13 11:25, Steven D'Aprano написав(ла):

...
In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do:

- pad with some given object; - raise an exception

3) emit last chunk incompleted;

Given a window size of two, and input data [a, b, c], are you suggesting a variety that returns this?

(a,b), (c,)

There is no need for a separate function for that. Given a version that takes a pad object, if the pad argument is not given, return a partial chunk at the end.

You had proposed raise an exception when the pad argument is not given in previous message. In any case these are three different cases, and you can combine only two of them in one function using "absent argument" trick.

...

...
4) skip incomplete chunk.

The very next sentence in my post references that:

"If you want to just ignore extra items, just catch the exception and continue."

There is no need for an extra function covering that case.

You can't just use this generator in expression (e.g. as an argument to list). You need special wrapper which catches en exception. This is fourth variant. And if you need just this variant, why it is not in the stdlib?

...

...
There is also a question about result's type. Sometimes you need an iterator of subsequences (i.e. split string on equal string chunks), sometimes an iterator of iterators is enough.

None of the other itertools functions treat strings specially. Why should this one?

Because I relatively often need this idiom and almost never need general function for iterators. I'm sure a function which splits sequences are enough in at least 90% cases when you need grouping function.

...

If you want to re-join them into strings, you can do so with a trivial wrapper:

(''.join(elements) for elements in group("some string", 3, pad=' '))

ought to do the trick, assuming group returns tuples or lists of characters.

This is too slow and verbose and kill benefits of grouping function.

Terry Reedy

10:09 p.m.

On 12/8/2013 7:16 AM, Steven D'Aprano wrote:

...

On Sun, Dec 08, 2013 at 01:30:56PM +0200, Serhiy Storchaka wrote:

...

...
There is also a question about result's type. Sometimes you need an iterator of subsequences (i.e. split string on equal string chunks), sometimes an iterator of iterators is enough.

None of the other itertools functions treat strings specially. Why should this one? If you want to re-join them into strings, you can do so with a trivial wrapper:

(''.join(elements) for elements in group("some string", 3, pad=' '))

A large fraction, perhaps over half, of the multiple requests for a chunker or grouper function are for sequences, not general iterables, as input, with the desired output type being the input type. For this, an iterator of *slices* is *far* more efficient. The same function could easily handle overlaps. (There are still the possible varieties of short slice handling). *Untested*: def window(seq, size, advance=None, extra='skip'): '''Yield successive slices of len size of sequence seq. Move window advance items (default = size). Extra determines the handling of extra items. The options are 'skip' (default), 'keep', and 'raise'. ''' if overlap == None: advance = size i, j, n = 0, size, len(seq) while j <= n: yield seq[i:j] i += advance j += advance if j < n + advance: if extra == 'keep': yield seq[i:j] elif extra == 'raise' raise ValueError('extra items') else: raise ValueError('bad extra') Having gotten this far, it would be possible to treat the above as a fast path for sequences and wrap it in try:except and if len or slice fail, fall back to a general iterator version. The result could be a builtin rather than itertool. -- Terry Jan Reedy

Oscar Benjamin

10:54 a.m.

On 8 December 2013 12:16, Steven D'Aprano <steve@pearwood.info> wrote:

...

On Sun, Dec 08, 2013 at 01:30:56PM +0200, Serhiy Storchaka wrote:

...
08.12.13 11:25, Steven D'Aprano написав(ла):

...
In the second case, there is a question about what to do with sequences that are not a multiple of the window size. Similar to zip(), there are two things one might do:

- pad with some given object; - raise an exception

3) emit last chunk incompleted;

Given a window size of two, and input data [a, b, c], are you suggesting a variety that returns this?

(a,b), (c,)

This is the only variant I have ever used. You can see an example use case here: https://mail.python.org/pipermail//python-ideas/2013-August/022767.html And this is the implementation I use: def chunks(iterable, chunksize=100): islices = map(islice, repeat(iter(iterable)), repeat(chunksize)) return takewhile(bool, map(list, islices)) Oscar

Stephen J. Turnbull

3:21 p.m.

Steven D'Aprano writes:

...

What other varieties are there? Surely none that are common. Once, for a lark, I tried to come up with one that was fully general -- as well as a window size, you could specify how far to advance the window each step. The sliding variety would advance by 1 each step, the discrete variety would advance by the window size. But I never found any reason to use it with any other step sizes. Surely anything else is more useful in theory than in practice.

Deseasonalization of serially correlated data where the seasonality is lower-frequency than the series, and more generally data-mining techniques that start with relatively coarse steps and refine as they go along are two that come immediately to mind.

Mathias Panzenböck

5:49 p.m.

On 12/08/2013 05:44 AM, Amber Yust wrote:

...

After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

Existing possible solutions include both the "clever" but somewhat unreadable...

batched_iter = zip(*[iter(input_iter)]*n)

...and the long-form...

def batch(input_iter, n): input_iter = iter(input_iter) while True: yield [input_iter.next() for _ in range(n)]

This function drops items if the length of the input sequence is not a multiple of n. Fix: def batch(it, n): it = iter(it) while True: slice = [] for _ in range(n): try: slice.append(it.next()) except StopIteration: if slice: yield slice return yield slice

...

There doesn't seem, however, to be one clear "right" way to do this. Every time I come up against this task, I go back to itertools expecting one of the grouping functions there to cover it, but they don't.

It seems like it would be a natural fit for itertools, and it would simplify things like processing of file formats that use a consistent number of lines per entry, et cetera.

~Amber

Amber Yust

6:06 p.m.

So does zip if the items are of unequal length, and the two code examples I provided (the one using zip and the long-form one) are equivalent. On Sun Dec 08 2013 at 9:49:56 AM, Mathias Panzenböck < grosser.meister.morti@gmx.net> wrote:

...

On 12/08/2013 05:44 AM, Amber Yust wrote:

...
After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

Existing possible solutions include both the "clever" but somewhat unreadable...

batched_iter = zip(*[iter(input_iter)]*n)

...and the long-form...

def batch(input_iter, n): input_iter = iter(input_iter) while True: yield [input_iter.next() for _ in range(n)]

This function drops items if the length of the input sequence is not a multiple of n. Fix:

def batch(it, n): it = iter(it) while True: slice = [] for _ in range(n): try: slice.append(it.next()) except StopIteration: if slice: yield slice return yield slice

...
There doesn't seem, however, to be one clear "right" way to do this. Every time I come up against this task, I go back to itertools expecting one of the grouping functions there to cover it, but they don't.

It seems like it would be a natural fit for itertools, and it would simplify things like processing of file formats that use a consistent number of lines per entry, et cetera.

~Amber

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Mathias Panzenböck

10:26 p.m.

I see. Well, I woudln't expect that behaviour from such a function. On 12/08/2013 07:06 PM, Amber Yust wrote:

...

So does zip if the items are of unequal length, and the two code examples I provided (the one using zip and the long-form one) are equivalent.

On Sun Dec 08 2013 at 9:49:56 AM, Mathias Panzenböck <grosser.meister.morti@gmx.net <mailto:grosser.meister.morti@gmx.net>> wrote:

On 12/08/2013 05:44 AM, Amber Yust wrote: > After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm > wondering why itertools doesn't have a function to break an iterator up into N-sized chunks. > > Existing possible solutions include both the "clever" but somewhat unreadable... > > batched_iter = zip(*[iter(input_iter)]*n) > > ...and the long-form... > > def batch(input_iter, n): > input_iter = iter(input_iter) > while True: > yield [input_iter.next() for _ in range(n)] >

This function drops items if the length of the input sequence is not a multiple of n. Fix:

def batch(it, n): it = iter(it) while True: slice = [] for _ in range(n): try: slice.append(it.next()) except StopIteration: if slice: yield slice return yield slice

> There doesn't seem, however, to be one clear "right" way to do this. Every time I come up against this task, I go back > to itertools expecting one of the grouping functions there to cover it, but they don't. > > It seems like it would be a natural fit for itertools, and it would simplify things like processing of file formats that > use a consistent number of lines per entry, et cetera. > > ~Amber >

_________________________________________________ Python-ideas mailing list Python-ideas@python.org <mailto:Python-ideas@python.org> https://mail.python.org/__mailman/listinfo/python-ideas <https://mail.python.org/mailman/listinfo/python-ideas> Code of Conduct: http://python.org/psf/__codeofconduct/ <http://python.org/psf/codeofconduct/>

Mark Lawrence

6:24 p.m.

On 08/12/2013 04:44, Amber Yust wrote:

...

After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

As discussed umpteen times previously, there is no way that we can agree on "a function" that can meet all of the variations that have been proposed on this theme. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

yoav glazner

6:57 p.m.

Hi On Sun, Dec 8, 2013 at 8:24 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wrote:

...

On 08/12/2013 04:44, Amber Yust wrote:

...
After seeing yet another person asking how to do this on #python (and having needed to do it in the past myself), I'm wondering why itertools doesn't have a function to break an iterator up into N-sized chunks.

As discussed umpteen times previously, there is no way that we can agree on "a function" that can meet all of the variations that have been proposed on this theme.

Maybe if we add this function:

...

...
...
def mod_pad(it, modulo, fillval): '"".join(mod_pad("hello", 3, fillval="!")) => hello!' for i, val in enumerate(iter(it)): yield val for _ in range(i%modulo): yield fillval

...

...
...
"".join(mod_pad("hello", 3, fillval="!")) 'hello!'

Then we can make grouper/batcher throw an exception in the case of iter_len % modulo != 0 grouper(mod_pad("hello", 3, fillval="!"), 3) => hel, lo! (in a iterator...) grouper('hello', 3) => BOOM

4112

Age (days ago)

4113

Last active (days ago)

List overview

Download

26 comments

14 participants

participants (14)

Amber Yust
Chris Angelico
Devin Jeanpierre
Mark Lawrence
Mathias Panzenböck
Nick Coghlan
Oscar Benjamin
Ron Adam
Serhiy Storchaka
Stephen J. Turnbull
Steven D'Aprano
Tal Einat
Terry Reedy
yoav glazner

Batching/grouping function for itertools

Mathias Panzenböck

Mathias Panzenböck

Mark Lawrence

tags

participants (14)