Adding a new function "zip_flat" to itertools (Re: Rewriting the "roundrobin" recipe in the itertools documentation)
Dear all: thank you for your replies and thoughts, most especially Steve
and Terry. I am more-or-less new to contributing to Python, so I wasn't
sure that the bug tracker was the best way to start -- I was looking for a
sanity check and received exactly what I wanted :) Thanks to the feedback
here, the bug tracker issue will be much cleaner from the start.
Regarding the meta discussion about my OP: yes it was long winded,
detailed, pedantic and (to a certain extent) bombastic, but I was imaging
the many possible responses to the suggestion (and suggested replacement)
the I felt I should explain where I was coming from. Even though there was
a lot of disagreement about how I described the current recipe as not very
Pythonic, I'm very glad for all the perspectives and lessons I got from
reading the ensuing discussions.
Now, having had some time to think, I've come up with further thoughts on
the topic. In particular, I was going to create a new bug, and wrote
several paragraphs on the topic summarizing this thread, and my thoughts.
I'll paste those paragraphs here:
"""
To summarize, I found the current implementation of the "roundrobin"
function difficult to understand, and proposed a much simpler solution
that, while producing correct results, isn't quite "correct" in the sense
that it glosses over a detail before "manually" correcting the detail at
the end, and as such is prone to severe inefficiency in extreme cases.
There were a smattering of comments from several people indicating that
they liked the simpler recipe better, despite its performance drawbacks.
Terry Reedy proposed a slightly rewritten version of the current recipe,
which implements the correct algorithm (without glossing over and manually
correcting the details). Although I have since changed my perceptions of
the original, now understanding how it works, the rewritten version from
Terry was clearer enough that I was able to understand *it* where I could
not previously understand the original. (My newfound understanding of the
original is largely derived from making sense of the rewritten version,
which properly clued me in to what cycle and islice actually do.
Either way, the current recipe can certainly be improved. Although I now
find the original and its rewrite to be algorithmically clean and correct
(even if the code and inline comments can be improved), the StackOverflow
question (
https://stackoverflow.com/questions/3678869/pythonic-way-to-combine-two-list...)
that originally got me thinking about the problem demonstrates that the
algorithmically clean way is *not* obvious at all to people who aren't very
familiar with the itertools module -- which is the large majority of people
who use Python (even if that's a very small fraction of the people reading
this bug). The second from top answer is the answer which references the
recipe in the docs, but as my own first post to python-ideas demonstrates,
the (large?) majority of python users aren't familiar enough with the
itertools module to be able to understand that recipe, and I also believe
many people were looking for one or two liners to use in their own code
without a function call. Further confusion on the overall topic is the lack
of a clear name -- "roundrobin", "alternate", "interleave", "merge", and
variations and others.
"""
Having completed those, I found a roughly duplicate StackOverflow question
to the one from my OP:
https://stackoverflow.com/questions/243865/how-do-i-merge-two-python-iterato...
Besides emphasizing my points about not having even a clear name for the
topic, a desire for one liners, mass confusion around the issue (especially
regarding flattening zip [which terminates on the shortest input, a hidden
gotcha], zip_longest [someone else found the same solution as me and others
in this op), and all around failure to generate anything even resembling a
consensus on the topic, I also found this answer:
https://stackoverflow.com/questions/243865/how-do-i-merge-two-python-iterato...
which proposes a solution that is both more correct and efficient than the
zip_longest-with-sentinels, and also noticeably more readable than either
the original doc recipe or even Terry's cleaned up replacement of it.
Given this variety of problems with the issue, I now think that -- while
updating the itertools recipe is certainly better than nothing -- the
better thing to do might be to just add a new function to itertools called
"zip_flat" which solves this problem. In addition to answering the stack
overflow questions with ongoing debate about efficiency, correctness, and
pythonicity (pythonicness?), it would also help to greatly clarify the
naming issue as well. (Sidenote: whoever came up with "zip" as the name for
the builtin was quite creative. It's a remarkably short and descriptive.)
What are the sentiments of readers here? If positive, I'll create an issue
on BPO about zip_flat (rather than just improving the docs recipe).
(Sorry Steve for bringing this back to -ideas! At least this time I'm
proposing an addition to the language itself! :)
Thanks for your consideration,
Bill
On Thu, Nov 16, 2017 at 5:06 PM, Steven D'Aprano
On Thu, Nov 16, 2017 at 02:56:29PM -0500, Terry Reedy wrote:
3) using a while loop in code dealing with iterables,
I agree that this is not necessary, and give a replacement below.
The OP isn't just saying that its unnecessary in this case, but that its unPythonic to ever use a while loop in code dealing with iterables. I disagree with that stronger statement.
def roundrobin(*iters): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" # Perhaps "flat_zip_nofill" is a better name, or something similar sentinel = object() for tup in it.zip_longest(*iters, fillvalue=sentinel): yield from (x for x in tup if x is not sentinel)
This adds and then deletes grouping and fill values that are not wanted. To me, this is an 'algorithm smell'. One of the principles of algorithm design is to avoid unnecessary calculations. For an edge case such as roundrobin(1000000*'a', ''), the above mostly does unnecessary work.
Its a recipe, not a tuned and optimized piece of production code.
And if you're going to criticise code on the basis of efficiency, then I would hope you've actually profiled the code first. Because it isn't clear to me at all that what you call "unnecessary work" is more expensive than re-writing the recipe using a more complex algorithm with calls to cycle and islice.
But I'm not here to nit-pick your recipe over the earlier ones.
[...]
Since I have a competing 'improvement', I would hold off on a PR until Raymond Hettinger, the itertools author, comments.
Raise a doc issue on the tracker, and take the discussion there. I think that this is too minor an issue to need long argument on the list. Besides, it's not really on-topic as such -- it isn't about a change to the language. Its about an implementation change to a recipe in the docs.
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Hi Bill, I don't have time to go through your email in detail and respond to every point you raise, but I'd like to respond to one point you made. On Mon, Nov 20, 2017 at 02:03:13AM -0600, bunslow wrote:
I also found this answer:
https://stackoverflow.com/questions/243865/how-do-i-merge-two-python-iterato...
which proposes a solution that is both more correct and efficient than the zip_longest-with-sentinels, and also noticeably more readable than either the original doc recipe or even Terry's cleaned up replacement of it.
Please don't make claims about correctness and efficiency without testing the code first. The second suggestion given there, using deque, is *not* correct as provided, as it fails to work with iterables. It requires the caller to pass only iterators, unlike the existing roundrobin recipe which accepts any iterable. Nor is it more efficient, at least on my machine -- in fact the opposite, it is the worst performing of the four recipes I've tried: - the current recipe from the itertools docs; - your re-write, using zip_longest; - Terry's version; - and the one from stackoverflow. I've attached my test code, in case you want to play around with it. Apologies in advance for any bugs in the test code (its 2 in the morning here and I've had a long day). According to my testing, on my computer using Python 3.5, Terry's code is by far the fastest in all three separate test cases, but that probably shouldn't count since it's buggy (it truncates the results and bails out early under some circumstances). Out of the implementations that don't truncate, the existing recipe is by far the fastest. Terry, if you're reading this, try: list(roundrobin('A', 'B', 'CDE')) Your version truncates the results to A B C instead of A B C D E as the itertools recipe gives. But I digress... the point is, if you're going to make claims about the correctness and efficiency of code, you really ought to actually check the correctness and efficiency first. I may or may not get around to responding to some of your other points over the next few days. -- Steve
Oops, a slight buglet in my test code: [Attachment: rr_tester.py] Change the constant NUMBER = 30 to the largest value you can bear for more reliable timing tests. I lowered the value for a quick test and forgot to increase it again. -- Steve
On 11/20/2017 11:08 AM, Steven D'Aprano wrote:
Please don't make claims about correctness and efficiency without testing the code first. The second suggestion given there, using deque, is *not* correct as provided, as it fails to work with iterables. It requires the caller to pass only iterators, unlike the existing roundrobin recipe which accepts any iterable.
Nor is it more efficient, at least on my machine -- in fact the opposite, it is the worst performing of the four recipes I've tried:
- the current recipe from the itertools docs; - your re-write, using zip_longest; - Terry's version; - and the one from stackoverflow.
I've attached my test code, in case you want to play around with it. Apologies in advance for any bugs in the test code (its 2 in the morning here and I've had a long day).
According to my testing, on my computer using Python 3.5, Terry's code is by far the fastest in all three separate test cases, but that probably shouldn't count since it's buggy (it truncates the results and bails out early under some circumstances). Out of the implementations that don't truncate, the existing recipe is by far the fastest.
Terry, if you're reading this, try:
list(roundrobin('A', 'B', 'CDE'))
Your version truncates the results to A B C instead of A B C D E as the itertools recipe gives.
This is due to an off-by-1 error which I corrected 3 hours later in a follow-up post, repeated below. --- Correct off-by-one error. I should have tested with an edge case such as print(list(roundrobin('ABC', '')))
The following combines 3 statements into one for statement.
def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for reduced_len in reversed(range(1, len(iterables))):
Make that 0 rather than 1 for start value.
try: for next in nexts: yield next() except StopIteration: nexts = cycle(islice(nexts, reduced_len))
A slightly clearer, slightly less efficient alternative would be def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for current_len in reversed(range(1, len(iterables)+1)): try: for next in nexts: yield next() except StopIteration: nexts = cycle(islice(nexts, current_len - 1)) -- Terry Jan Reedy
After correcting Terry's off-by-one error, I get this on Python 3.6 (and
bumping NUMBER to 1000). In speed, Terry's is either very slightly faster
or very slightly slower than the recipe, depending on the data. I think
Terry's is more readable that roundrobin(), but still requires a little
thought compared to Bunslow's. However, using the builtin name 'next' as a
variable name seems like a bad choice to me (it doesn't break in this code,
but seems like bad practice, i'd choose some other variable name).
507-code % python rr_tester.py
testing with equal_data
roundrobin 0.187816
bunslow 0.393162
terry 0.183851
stackoverflow 0.675543
**************************************************
testing with unequal_data
roundrobin 0.209959
bunslow 0.555731
terry 0.233015
stackoverflow 0.746880
**************************************************
testing with extreme_data
roundrobin 0.053149
bunslow 0.273607
terry 0.051963
stackoverflow 0.158515
508-code % python --version
Python 3.6.2 :: Anaconda custom (x86_64)
On Mon, Nov 20, 2017 at 9:15 AM, Terry Reedy
On 11/20/2017 11:08 AM, Steven D'Aprano wrote:
Please don't make claims about correctness and efficiency without
testing the code first. The second suggestion given there, using deque, is *not* correct as provided, as it fails to work with iterables. It requires the caller to pass only iterators, unlike the existing roundrobin recipe which accepts any iterable.
Nor is it more efficient, at least on my machine -- in fact the opposite, it is the worst performing of the four recipes I've tried:
- the current recipe from the itertools docs; - your re-write, using zip_longest; - Terry's version; - and the one from stackoverflow.
I've attached my test code, in case you want to play around with it. Apologies in advance for any bugs in the test code (its 2 in the morning here and I've had a long day).
According to my testing, on my computer using Python 3.5, Terry's code is by far the fastest in all three separate test cases, but that probably shouldn't count since it's buggy (it truncates the results and bails out early under some circumstances). Out of the implementations that don't truncate, the existing recipe is by far the fastest.
Terry, if you're reading this, try:
list(roundrobin('A', 'B', 'CDE'))
Your version truncates the results to A B C instead of A B C D E as the
itertools recipe gives.
This is due to an off-by-1 error which I corrected 3 hours later in a follow-up post, repeated below. ---
Correct off-by-one error. I should have tested with an edge case such as print(list(roundrobin('ABC', '')))
The following combines 3 statements into one for statement.
def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for reduced_len in reversed(range(1, len(iterables))):
Make that 0 rather than 1 for start value.
try: for next in nexts: yield next() except StopIteration: nexts = cycle(islice(nexts, reduced_len))
A slightly clearer, slightly less efficient alternative would be
def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for current_len in reversed(range(1, len(iterables)+1)): try: for next in nexts: yield next() except StopIteration: nexts = cycle(islice(nexts, current_len - 1))
-- Terry Jan Reedy
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Mon, Nov 20, 2017 at 12:15:50PM -0500, Terry Reedy wrote:
Your version truncates the results to A B C instead of A B C D E as the itertools recipe gives.
This is due to an off-by-1 error which I corrected 3 hours later in a follow-up post, repeated below.
Ah, I missed that, thanks.
def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for reduced_len in reversed(range(1, len(iterables))):
Make that 0 rather than 1 for start value.
Is there a reason for calling reversed() instead of reversing the order of range in the first place? range(len(iterables)-1, -1, -1) -- Steve
On 11/20/2017 4:57 PM, Steven D'Aprano wrote:
On Mon, Nov 20, 2017 at 12:15:50PM -0500, Terry Reedy wrote:
Your version truncates the results to A B C instead of A B C D E as the itertools recipe gives.
This is due to an off-by-1 error which I corrected 3 hours later in a follow-up post, repeated below.
Ah, I missed that, thanks.
def roundrobin(*iterables): "roundrobin('ABC', 'D', 'EF') --> A D E B F C" nexts = cycle(iter(it).__next__ for it in iterables) for reduced_len in reversed(range(1, len(iterables))):
Make that 0 rather than 1 for start value.
Is there a reason for calling reversed() instead of reversing the order of range in the first place?
range(len(iterables)-1, -1, -1)
Readability. Accurately and automatically reversing ranges was one of the use cases motivating the addition of 'reversed'. Ranges have a __reversed__ method which returns a iterator for a reversed range.
reversed(range(0, n))
list(reversed(range(0, n))) == list(iter(range(n-1, -1, -1))) True
-- Terry Jan Reedy
(Regarding the stackoverflow link, he does have a version that includes the
trivial fix to apply iter() to args, at the bottom of the post.)
I think that the variety of solutions with a variety of merits is
indicative that it would be useful to have a "zip_flat" in the itertools
module, but I guess support for that isn't very strong here. I see that
Terry opened a bug to clean up the current recipe in the docs, which is
better than nothing, but I think it's the absolute bare minimum improvement.
I'm prepared to drop the subject unless there's further discussion in favor
of it.
Thanks again to all for the replies.
Bill
On Mon, Nov 20, 2017 at 10:08 AM, Steven D'Aprano
Hi Bill,
I don't have time to go through your email in detail and respond to every point you raise, but I'd like to respond to one point you made.
On Mon, Nov 20, 2017 at 02:03:13AM -0600, bunslow wrote:
I also found this answer:
https://stackoverflow.com/questions/243865/how-do-i- merge-two-python-iterators/40498526#40498526
which proposes a solution that is both more correct and efficient than the zip_longest-with-sentinels, and also noticeably more readable than either the original doc recipe or even Terry's cleaned up replacement of it.
Please don't make claims about correctness and efficiency without testing the code first. The second suggestion given there, using deque, is *not* correct as provided, as it fails to work with iterables. It requires the caller to pass only iterators, unlike the existing roundrobin recipe which accepts any iterable.
Nor is it more efficient, at least on my machine -- in fact the opposite, it is the worst performing of the four recipes I've tried:
- the current recipe from the itertools docs; - your re-write, using zip_longest; - Terry's version; - and the one from stackoverflow.
I've attached my test code, in case you want to play around with it. Apologies in advance for any bugs in the test code (its 2 in the morning here and I've had a long day).
According to my testing, on my computer using Python 3.5, Terry's code is by far the fastest in all three separate test cases, but that probably shouldn't count since it's buggy (it truncates the results and bails out early under some circumstances). Out of the implementations that don't truncate, the existing recipe is by far the fastest.
Terry, if you're reading this, try:
list(roundrobin('A', 'B', 'CDE'))
Your version truncates the results to A B C instead of A B C D E as the itertools recipe gives.
But I digress... the point is, if you're going to make claims about the correctness and efficiency of code, you really ought to actually check the correctness and efficiency first.
I may or may not get around to responding to some of your other points over the next few days.
-- Steve
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
21.11.17 01:00, Terry Reedy пише:
On 11/20/2017 4:57 PM, Steven D'Aprano wrote:
Is there a reason for calling reversed() instead of reversing the order of range in the first place?
range(len(iterables)-1, -1, -1)
Readability. Accurately and automatically reversing ranges was one of the use cases motivating the addition of 'reversed'. Ranges have a __reversed__ method which returns a iterator for a reversed range.
Agree. You also can write this as range(1, len(iterables))[::-1]. This is shorter and (I suppose) slightly faster. But reverse() can be better choice for a recipe, if it looks less confusing for beginners.
participants (5)
-
bunslow
-
David Mertz
-
Serhiy Storchaka
-
Steven D'Aprano
-
Terry Reedy