[Python-ideas] itertools.chunks()

Sun Apr 7 11:51:17 CEST 2013

> From: Mathias Panzenböck <grosser.meister.morti at gmx.net>

> Sent: Saturday, April 6, 2013 9:19 PM
> 
> Also I find myself often writing helper functions like these:
> 
> def chunked(sequence,size):
>     i = 0
>     while True:
>         j = i
>         i += size
>         chunk = sequence[j:i]
>         if not chunk:
>             return
>         yield chunk

The grouper function in the itertools recipes does the same thing, except that it works for any iterable, not just sequences (and it fills out the last group with an optional fillvalue).

> def chunked_stream(stream,size):
>     while True:
>         chunk = stream.read(size)
>         if not chunk:
>             return
>         yield chunk

This is just iter(partial(stream.read, size), '').

> Maybe these functions should be in the stdlib? Too trivial?

I personally agree that grouper, and some of the other itertools recipes, should actually be included in the module, so you could just import itertools and call grouper instead of having to copy the 3 lines of code into dozens of different programs. But I personally deal with that by just installing more-itertools off PyPI.

As for the original suggestion:

> On 04/06/2013 02:50 PM, Giampaolo Rodolà wrote:

>>  def chunks(total, step):
>>       assert total >= step
>>       while total > step:
>>           yield step;
>>           total -= step;
>>       if total:
>>           yield total

I honestly don't think this is very useful. For one thing, if you really need it, it's equivalent to a trivial genexp:

min(step, total - chunkstart) for chunkstart in range(0, total, step)

For another, I think most obvious uses for it would be better done at a higher level. For example:

>>  FILESIZE = (10 * 1024 * 1024) + 423  # 10MB and 423 bytes
>>  with open(TESTFN, 'wb') as f:
>>        for csize in chunks(FILESIZE, 262144):
>>            f.write(b'x' * csize)

First, is the memory cost of f.write(b'x' * FILESIZE) really an issue for your program? If so, aren't you better off creating an mmap and filling it with x? And if you want to do it with itertools, can't you just chunk repeat(b'x') instead of explicitly generating the lengths and multiplying them?

Besides, the logic here is actually a bit hidden. You create a FILESIZE which is 10MB and 423 bytes, and then you use a function that writes the 10MB in groups of 256KB and then writes the 423 bytes. Why not just keep it simple?

with open(TESTFN, 'wb') as f:
    for _ in range(10 * 1024 / 256):
        f.write(b'x' * (256*1024))
    f.write(b'x' * 423)