[Python-Dev] [Python-ideas] itertools.chunks(iterable, size, fill=None)

anatoly techtonik techtonik at gmail.com
Thu Jul 5 16:33:19 CEST 2012


On Wed, Jul 4, 2012 at 9:31 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 7/4/2012 5:57 AM, anatoly techtonik wrote:
>>
>> On Fri, Jun 29, 2012 at 11:32 PM, Georg Brandl <g.brandl at gmx.net> wrote:
>
>
>>> Anatoly, so far there were no negative votes -- would you care to go
>>> another step and propose a patch?
>>
>>
>> Was about to say "no problem",
>
>
> Did you read that there *are* strong negative votes? And that this idea has
> been rejected before? I summarized the objections in my two responses and
> pointed to the tracker issues. One of the objections is that there are 4
> different things one might want if the sequence length is not an even
> multiple of the chunk size. Your original 'idea' did not specify.

I actually meant that there is a problem to propose a patch in the
sense of getting checkout, working on a diff, sending it by attaching
to bug tracker as developer guide says.

>> For now the best thing I can do (I don't risk even to mention anything
>> with 3.3) is to copy/paste code from the docs here:
>>
>> from itertools import izip_longest
>> def chunks(iterable, size, fill=None):
>>      """Split an iterable into blocks of fixed-length"""
>>      # chunks('ABCDEFG', 3, 'x') --> ABC DEF Gxx
>>      args = [iter(iterable)] * size
>>      return izip_longest(fillvalue=fill, *args)
>
>
> Python ideas is about Python 3 ideas. Please post Python 3 code.
>
> This is actually a one liner
>
>     return zip_longest(*[iter(iterable)]*size, fillvalue=file)
>
> We don't generally add such to the stdlib.

Can you figure out from the code what this stuff does?
It doesn't give chunks of strings.

>> BTW, this doesn't work as expected (at least for strings). Expected is:
>>    chunks('ABCDEFG', 3, 'x') --> 'ABC' 'DEF' 'Gxx'
>> got:
>>    chunks('ABCDEFG', 3, 'x') --> ('A' 'B' 'C') ('D' 'E' 'F') ('G' 'x' 'x')
>
>
> One of the problems with idea of 'add a chunker' is that there are at least
> a dozen variants that different people want.

That's not the problem. People always want something extra. The
problem that we don't have a real wish distribution. If 1000 people
want chunks and 1 wants groups with exception - we still account these
as equal variants.

Therefore my idea is deliberately limited to "string to chunks" user
story, and SO implementation proposal.

> I discussed the problem of
> return types issue in my responses. I showed how to get the 'expected'
> response above using grouper, but also suggested that it is the wrong basis
> for splitting strings. Repeated slicing make more sense for concrete
> sequence types.
>
> def seqchunk_odd(s, size):
>     # include odd size left over
>     for i in range(0, len(s), size):
>         yield s[i:i+size]
>
> print(list(seqchunk_odd('ABCDEFG', 3)))
> #
> ['ABC', 'DEF', 'G']

Right. That's the top answer on SO that people think should be in
stdlib. Great we are talking about the same thing actually.

> def seqchunk_even(s, size):
>     # only include even chunks
>     for i in range(0, size*(len(s)//size), size):
>         yield s[i:i+size]
>
> print(list(seqchunk_even('ABCDEFG', 3)))
> #
> ['ABC', 'DEF']

This is deducible from seqchunk_odd(s, size)

> def strchunk_fill(s, size, fill):
>     # fill odd chunks
>     q, r = divmod(len(s), size)
>     even = size * q
>     for i in range(0, even, size):
>         yield s[i:i+size]
>     if size != even:
>         yield s[even:] + fill * (size - r)
>
> print(list(strchunk_fill('ABCDEFG', 3, 'x')))
> #
> ['ABC', 'DEF', 'Gxx']

Also deducible from seqchunk_odd(s, size)

> Because the 'fill' value is necessarily a sequence for strings,
> strchunk_fill would only work for lists and tuples if the fill value were
> either required to be given as a tuple or list of length 1 or if it were
> internally converted inside the function. Skipping that for now.
>
> Having written the fill version based on the even version, it is easy to
> select among the three behaviors by modifying the fill version.
>
> def strchunk(s, size, fill=NotImplemented):
>     # fill odd chunks
>     q, r = divmod(len(s), size)
>     even = size * q
>     for i in range(0, even, size):
>         yield s[i:i+size]
>     if size != even and fill is not NotImplemented:
>         yield s[even:] + fill * (size - r)
>
> print(*strchunk('ABCDEFG', 3))
> print(*strchunk('ABCDEFG', 3, ''))
> print(*strchunk('ABCDEFG', 3, 'x'))
> #
> ABC DEF
> ABC DEF G
> ABC DEF Gxx

I now don't even think that fill value is needed as argument.
if len(chunk) < size:
  chunk.extend( [fill] * ( size - len(chunk)) )

> I already described how something similar could be done by checking each
> grouper output tuple for a fill value, but that requires that the fill value
> be a sentinal that could not otherwise appear in the tuple. One could modify
> grouper to fill with a private object() and check the last item of each
> group for that sentinal and act accordingly (delete, truncate, or replace).
> A generic api needs some thought, though.

I just need to chunk strings and sequences. Generic API is too complex
without counting all usecases and iterating over them.

> An issue I did not previously mention is that people sometimes want
> overlapping chunks rather than contiguous disjoint chunks. The slice
> approach trivially adapts to that.
>
> def seqlap(s, size):
>     for i in range(len(s)-size+1):
>         yield s[i:i+size]
>
> print(*seqlap('ABCDEFG', 3))
> #
> ABC BCD CDE DEF EFG
>
> A sliding window for a generic iterable requires a deque or ring buffer
> approach that is quite different from the zip-longest -- grouper approach.

That's why I'd like to drastically reduce the scope of proposal.
itertools doesn't seem to be the best place anymore. How about
sequence method?

   string.chunks(size)  -> ABC DEF G
   list.chunks(size) -> [A,B,C], [C,D,E],[G]

If somebody needs a keyword argument - this can come later without
breaking compatibility.


More information about the Python-Dev mailing list