[Python-ideas] itertools.chunks(iterable, size, fill=None)

anatoly techtonik techtonik at gmail.com
Thu Jul 5 15:36:24 CEST 2012


Before anything else I must apologize for significant lags in my
replies. I can not read all of them to hold in my head, so I reply one
by one as it goes trying not to miss a single point out there. It
would be much easier to do this in unified interface for threaded
discussions, but for now there is no capabilities for that neither in
Mailman nor in GMail. And when it turns out that the amount of text is
too big, and I spend a lot of time trying to squeeze it down and then
it becomes pointless to send at all.

Now back on the topic:

On Sun, Jul 1, 2012 at 12:09 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 6/29/2012 4:32 PM, Georg Brandl wrote:
>>
>> On 26.06.2012 10:03, anatoly techtonik wrote:
>>>
>>> Now that Python 3 is all about iterators (which is a user killer
>>> feature for Python according to StackOverflow -
>>> http://stackoverflow.com/questions/tagged/python) would it be nice to
>>> introduce more first class functions to work with them? One function
>>> to be exact to split string into chunks.
>
> Nothing special about strings.

It seemed so, but it just appeared that grouper recipe didn't work for me.

>>>      itertools.chunks(iterable, size, fill=None)
>
> This is a renaming of itertools.grouper in 9.1.2. Itertools Recipes. You
> should have mentioned this. I think of 'blocks' rather than 'chunks', but I
> notice several SO questions with 'chunk(s)' in the title.

I guess `block` gives too low signal/noize ration in search results.
That's why it probably also called chunks in other languages, where
`block` stand for something else (I speak of Ruby blocks).

>>> Which is the 33th most voted Python question on SO -
>>>
>>> http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464
>
> I am curious how you get that number. I do note that there are about 15
> other Python SO questions that seem to be variations on the theme. There
> might be more if 'blocks' and 'groups' were searched for.

It's easy:
1. Go http://stackoverflow.com/
2. Search [python]
3. Click `votes` tab
4. Choose `30 per page` at the bottom
5. Jump to the second page, there it is 4th from the top:
http://stackoverflow.com/questions/tagged/python?page=2&sort=votes&pagesize=30

As for duplicates - feel free to mark them as such. SO allows
everybody to do this (unlike Roundup).

>> Anatoly, so far there were no negative votes -- would you care to go
>> another step and propose a patch?
>
> That is because Raymond H. is not reading either list right now ;-)
> Hence the Cc:. Also because I did not yet respond to a vague, very
> incomplete idea.
>
> From Raymond's first message on http://bugs.python.org/issue6021 , add
> grouper:
>
> "This has been rejected before.

I quite often see such arguments and I can't stand to repeat that
these are not arguments. It is good to know, but when people use that
as a reason to close tickets - that's just disgusting. To the
Raymond's honor he cares to explain.

> * It is not a fundamental itertool primitive.  The recipes section in
> the docs shows a clean, fast implementation derived from zip_longest().

What is the definition of 'fundamental primitive'?
To me the fact that top answer for chunking strings on SO has 2+ times
more votes than itertools versions is a clear 5 sigma indicator that
something is wrong with this Standard model without chunks boson.

> * There is some debate on a correct API for odd lengths.  Some people
> want an exception, some want fill-in values, some want truncation, and
> some want a partially filled-in tuple.  The alone is reason enough not
> to set one behavior in stone.

use case 3.1: odd lengths exception (CHOOSE ONE)
1. I see that no itertools function throws exceptions, check manually:
    len(iterable) / float(size) ==  len(iterable) // float(size)
2. Explicitly
  -  itertools.chunks(iterable, size, fill=None)
  +  itertools.chunks(iterable, size, fill=None, exception=False)

use case 3.2. fill in value. it is here (SOLVED)

use case 3.3: truncation
no itertools support truncation, do manually
   chunks(iter, size)[:len(iter)//size)

use case 4: partially filled-in tuple
  What should be there?
   >>> chunks('ABCDEFG', 3, 'x')
   >>> |


More replies and workarounds to some of the raised points are below.

> * There is an issue with having too many itertools.  The module taken as
> a whole becomes more difficult to use as new tools are added."

There can be only two reasons to that:
* chosen basis is bad (many functions that are rarely used or easily emulated)
* basis is good, but insufficient, because iterators universe is more
complicated
  than we think

> This is not to say that the question should not be re-considered. Given the
> StackOverflow experience in addition to that of the tracker and python-list
> (and maybe python-ideas), a special exception might be made in relation to
> points 1 and 3.

--[offtopic about Python enhancements / proposals feedback]--
Yes, without SO I probably wouldn't trigger this at all. Because
tracker doesn't help with raising importance - there are no votes, no
feature proposals, no "stars". And what I "like" the most is that very
"nice" resolution status - "committed/rejected" - which doesn't say
anything at all. Python list? I try not to disrupt the frequency
there. Python ideas? Too low participation level for gathering
signals. There are many people that read, support, but don't want to
reply (don't want to stand out or just lazy). There are many outside
who don't want to be subscribed at all. There are 2000+ people
spending time on Python conferences all over the world each year we
see only a couple of reactions for every Python idea here. Quite often
there are mistakes and omissions that would be nice to correct and you
can't. So StackOverflow really helps here, but it is a Q&A tool, which
is still much better than ML that are solely for chatting,
brainstorming and all the crazy reading / writing stuff. They don't
help to develop ideas collaboratively. Quite often I am just lost in
amount of text to handle.
--[/offtopic]--

> It regard to point 2: many 'proposals', including Anatoly's, neglect this
> detail. But the function has to do *something* when seqlen % grouplen != 0.
> So an 'idea' is not really a concrete programmable proposal until
> 'something' is specified.
>
> Exception -- not possible for an itertool until the end of the iteration
> (see below). To raise immediately for sequences, one could wrap grouper.
>
> def exactgrouper(sequence, k):  # untested
>   if len(sequence) % k:
>     raise ValueError('Sequence length {} must be a multiple of group length
> {}'.format(len(sequence), k)
>   else:
>     return itertools.grouper(sequence, k)

Right. Iterator is not a sequence, because it doesn't know the length
of its sequence. The method should not belong to itertools at all
then.

Python 3 is definitely become more complicated. I'd prefer to keep
separated from iterator stuff, but it seems more harder with every
iteration.

> Of course, sequences can also be directly sequentially sliced (but should
> the result be an iterable or sequence of blocks?). But we do not have a
> seqtools module and I do not think there should be another method added to
> the seq protocol.

I'd expect strings chunked into strings and lists into lists. Don't
want to know anything about protocols.

> Fill -- grouper always does this, with a default of None.
>
> Truncate, Remainder -- grouper (zip_longest) cannot directly do this and no
> recipes are given in the itertools docs. (More could be, see below.)
>
> Discussions on python-list gives various implementations either for
> sequences or iterables. For the latter, one approach is "it =
> iter(iterable)" followed by repeated islice of the first n items. Another is
> to use a sentinal for the 'fill' to detect a final incomplete block (tuple
> for grouper).
>
> def grouper_x(n, iterable):  # untested
>   sentinal = object()
>   for g in grouper(n, iterable, sentinal):
>     if g[-1] != sentinal:
>       yield g
>     else:
>       # pass to truncate
>       # yield g[:g.index(sentinal) for remainer
>       # raise ValueError for delayed exception

We need a simple function to split a sequence into chunks(). Now we
face with the problem to apply that technique to a sequence of
infinite length when a last element of infinite sequence is
encountered. You might be thinking now that this is a reduction to
absurdity. But I'd say it is an exit from the trap. Mathematically
this problem can't be solved. I am not ignoring your solution - I
think it's quite feasible, but isn't it an overcomplication?

I mean 160 people out of 149 who upvoted the question are pretty happy
with an answer that just outputs the last chunk as-is:
http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python

   chunks('ABCDEFG', 3) --> 'ABC' 'DEF' 'G'

And it is quite nice solution to me, because you're free to do
anything you'd like if you expect you data to be odd:

   for chunk in chunks('ABCDEFG', size):
     if len(chunk) < size:
       raise Tail

You can make a helper iterator out of it too.

> ---
> The above discussion of point 2 touches on point 4, which Raymond neglected
> in the particular message above but which has come up before: What are the
> allowed input and output types? An idea is not a programmable proposal until
> the domain, range, and mapping are specified.

Domain? Mapping? I am not ignoring existing knowledge and experience.
I just don't want to complicate and don't see appropriate `import
usecase` in current context, so I won't try to guess what this means.

in string -> out list of strings
in list -> out list of lists

> Possible inputs are a specific sequence (string, for instance), any
> sequence, any iterable. Possible outputs are a sequence or iterator of
> sequence or iterator. The various python-list and stackoverflow posts
> questions asks for various combinations. zip_longest and hence grouper takes
> any iterable and returns an iterator of tuples. (An iterator of maps might
> be more useful as a building block.) This is not what one usually wants with
> string input, for instance, nor with range input. To illustrate:

Allright. Got it. Sequences have a length and can be sliced with
[i:j], iterator can't be sliced (and hence no chunks can be made). So
this function doesn't belong to itertools - it is a missing string or
sequence method. We can't have a chunk with an iterator, because
iterator over a string decomposes it into a group of pieces with no
reverse function. We can have a group and then join the group into
something. But this requires the knowledge of appropriate join()
function for the iterator, and probably not efficient. As there are no
such function (must be that Mapping you referenced above) - the
recomposition into chunks is impossible.

> import itertools as it
>
> def grouper(n, iterable, fillvalue=None):
>     "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
>     args = [iter(iterable)] * n
>     return it.zip_longest(*args, fillvalue=fillvalue)
>
> print(*(grouper(3, 'ABCDEFG', 'x')))  # probably not wanted
> print(*(''.join(g) for g in grouper(3, 'ABCDEFG', 'x')))
> #
> ('A', 'B', 'C') ('D', 'E', 'F') ('G', 'x', 'x')
> ABC DEF Gxx
>
> --
> What to do? One could easily write 20 different functions. So more thought
> is needed before adding anything. -1 on the idea as is.

I've learned a new English type of argument - "straw man" (I used to
call this "hijacking"). This -1 doesn't belong to original idea. It
belongs to proposal of  itertools.chunks() with a long list of above
points and completely different user stories (i.e. not "split string
into chunks"). I hope you still +1 with 160 people on SO that think
Python needs an easy way to chunk sequences.

> For the doc, I think it would be helpful here and in most module subchapters
> if there were a subchapter table of contents at the top (under 9.1 in this
> case). Even though just 2 lines here (currently, but see below), it would
> let people know that there *is* a recipes section. After the appropriate
> tables, mention that there are example uses in the recipe section. Possibly
> add similar tables in the recipe section.

Unfortunately, it appeared that grouper() is not chunks(). It doesn't
delivers list of list of chars given string as an input instead of
list of chunks.

> Another addition could be a new subsection on grouping (chunking) that would
> discuss post-processing of grouper (as discussed above), as well as other
> recipes, including ones specific to strings and sequences. It would
> essentially be a short how-to. Call it 9.1.3 "Grouping, Blocking, or
> Chunking Sequences and Iterables". The synonyms will help external
> searching. A toc would let people who have found this doc know to look for
> this at the bottom.

This makes matters pretty ugly. In ideal language there should be less
docs, not more.



More information about the Python-ideas mailing list