Looking for a "batch" function

Hi all, An operation I often want in my Python code is some kind of simple batch function. The batch function would take an iterator and return same-size batches from it, except the last batch, which could be smaller. Two parameters would be required: the iterator and the size of each batch. Here are some examples of what I would expect this batch function to do. Get batches from a list:
list(batch([1,2,3,4,5], 2)) [[1, 2], [3, 4], [5]]
Get batches from a string:
list(batch('one two six ten', 4)) ['one ', 'two ', 'six ', 'ten']
Organize a stream of objects into a table:
list(batch(['Somewhere', 'CA', 90210, 'New York', 'NY', 10001], 3)) [['Somewhere', 'CA', 90210], ['New York', 'NY', 10001]]
My intuition tells me that such a function should exist in Python, but I have not found it in the builtin functions, slice operators, or itertools. Did I miss it? Here is an implementation that satisfies all of the above examples, but requires a sliceable sequence as input, not just an iterator: def batch(input, batch_size): while input: yield input[:batch_size] input = input[batch_size:] Obviously, I can just include that function in my projects, but I wonder if there is some built-in version of it. If there isn't, maybe there should be. Shane

On Sat, Jul 17, 2010 at 1:36 AM, Shane Hathaway <shane@hathawaymix.org> wrote:
Hi all,
An operation I often want in my Python code is some kind of simple batch function. The batch function would take an iterator and return same-size batches from it, except the last batch, which could be smaller. Two parameters would be required: the iterator and the size of each batch. Here are some examples of what I would expect this batch function to do.
Get batches from a list:
list(batch([1,2,3,4,5], 2)) [[1, 2], [3, 4], [5]]
Get batches from a string:
list(batch('one two six ten', 4)) ['one ', 'two ', 'six ', 'ten']
Organize a stream of objects into a table:
list(batch(['Somewhere', 'CA', 90210, 'New York', 'NY', 10001], 3)) [['Somewhere', 'CA', 90210], ['New York', 'NY', 10001]]
My intuition tells me that such a function should exist in Python, but I have not found it in the builtin functions, slice operators, or itertools. Did I miss it?
IMO, yes.
Obviously, I can just include that function in my projects, but I wonder if there is some built-in version of it. If there isn't, maybe there should be.
See the "grouper" recipe in itertools: http://docs.python.org/library/itertools.html#recipes It does almost exactly what you want: grouper(3, 'ABCDEFG', 'x') --> ['A','B','C'], ['D','E','F'], ['G','x','x'] Cheers, Chris -- http://blog.rebertia.com

On 07/17/2010 02:52 AM, Chris Rebert wrote:
See the "grouper" recipe in itertools: http://docs.python.org/library/itertools.html#recipes It does almost exactly what you want: grouper(3, 'ABCDEFG', 'x') --> ['A','B','C'], ['D','E','F'], ['G','x','x']
Interesting, but I have a few concerns with that answer: - It ignores the type of the container. If I provide a string as input, I expect an iterable of strings as output. - If I give a batch size of 1000000, grouper() is going to be rather inefficient. Even worse would be to allow users to specify the batch size. - Since grouper() is not actually in the standard library and it doesn't do quite what I need, it's rather unlikely that I'll use it. Another possible name for this functionality I am describing is packetize(). Computers always packetize data for transmission, storage, and display to users. Packetizing seems like such a common need that I think it should be built in to Python. Shane

On Sat, Jul 17, 2010 at 9:50 PM, Shane Hathaway wrote:
On 07/17/2010 02:52 AM, Chris Rebert wrote:
See the "grouper" recipe in itertools: http://docs.python.org/library/itertools.html#recipes It does almost exactly what you want: grouper(3, 'ABCDEFG', 'x') --> ['A','B','C'], ['D','E','F'], ['G','x','x']
Interesting, but I have a few concerns with that answer:
- It ignores the type of the container. If I provide a string as input, I expect an iterable of strings as output.
- If I give a batch size of 1000000, grouper() is going to be rather inefficient. Even worse would be to allow users to specify the batch size.
- Since grouper() is not actually in the standard library and it doesn't do quite what I need, it's rather unlikely that I'll use it.
Another possible name for this functionality I am describing is packetize(). Computers always packetize data for transmission, storage, and display to users. Packetizing seems like such a common need that I think it should be built in to Python.
This reminds me of discussions about a "flatten" function. This kind of operation often has slightly different requirements in different scenarios. It is very simple to implement a version of this to meet your exact needs. Sometimes in these kinds of situations it is better not to have a built-in generic function, to force programmers to decide explicitly how they want it to work. You mentioned efficiency; to do this kind of operation efficiently ones really needs to know what kind of sequence/iterator is being "packetized", and implement accordingly. - Tal Einat

On Sun, Jul 18, 2010 at 6:30 AM, Tal Einat <taleinat@gmail.com> wrote:
This kind of operation often has slightly different requirements in different scenarios. It is very simple to implement a version of this to meet your exact needs. Sometimes in these kinds of situations it is better not to have a built-in generic function, to force programmers to decide explicitly how they want it to work.
You mentioned efficiency; to do this kind of operation efficiently ones really needs to know what kind of sequence/iterator is being "packetized", and implement accordingly.
Indeed. There's actually a reasonably decent general windowing recipe on ASPN (http://code.activestate.com/recipes/577196-windowing-an-iterable-with-iterto...), but even that isn't appropriate for every use case. The OP, for example, has rather different requirements to what is implemented there: - non-overlapping windows, so tee() isn't needed - return type should match original container type A custom generator for that task is actually pretty trivial (note: untested, so may contain typos): def windowed(seq, window_len): for slice_start in range(0, len(seq), window_len): # use xrange() in 2.x slice_end = slice_start + window_len yield seq[slice_start:slice_end] Even adding support for overlapped windows is fairly easy: def windowed(seq, window_len, overlap=0): slice_step = window_len - overlap for slice_start in range(0, len(seq), slice_step): # use xrange() in 2.x slice_end = slice_start + window_len yield seq[slice_start:slice_end] However, those approaches don't support arbitrary iterators (i.e. those without __len__), they only support sequences. To support arbitrary iterators, you'd need to do something fancier with either collections.deque (either directly or via itertools.tee), but again, the most appropriate approach is going to be application specific (for byte data, you're probably going to want to use buffer or memoryview rather than the original container type). It isn't that this is an uncommon problem - it's that any appropriately general solution is going to be suboptimal in many specific applications, while an optimal solution for specific applications is going to be insufficiently general to be appropriate for the standard library. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 07/17/2010 07:02 PM, Nick Coghlan wrote:
It isn't that this is an uncommon problem - it's that any appropriately general solution is going to be suboptimal in many specific applications, while an optimal solution for specific applications is going to be insufficiently general to be appropriate for the standard library.
Well, I feel like there is in fact a general solution that would be near optimal for many applications, but I would rather spend time refining the idea in real projects rather than get much into theory at the moment. Thanks for the feedback. Shane

Shane Hathaway wrote:
- It ignores the type of the container. If I provide a string as input, I expect an iterable of strings as output.
Fine, but...
- If I give a batch size of 1000000, grouper() is going to be rather inefficient.
I guess you would prefer each batch to be a lazy iterator over part of the original sequence -- but that would conflict with the previous requirement. -- Greg
participants (5)
-
Chris Rebert
-
Greg Ewing
-
Nick Coghlan
-
Shane Hathaway
-
Tal Einat