[Python-ideas] Adding a new function "zip_flat" to itertools (Re: Rewriting the "roundrobin" recipe in the itertools documentation)

Mon Nov 20 11:08:02 EST 2017

Hi Bill, 

I don't have time to go through your email in detail and respond to 
every point you raise, but I'd like to respond to one point you made.

On Mon, Nov 20, 2017 at 02:03:13AM -0600, bunslow wrote:

> I also found this answer:
> 
> https://stackoverflow.com/questions/243865/how-do-i-merge-two-python-iterators/40498526#40498526
> 
> which proposes a solution that is both more correct and efficient than the
> zip_longest-with-sentinels, and also noticeably more readable than either
> the original doc recipe or even Terry's cleaned up replacement of it.

Please don't make claims about correctness and efficiency without 
testing the code first. The second suggestion given there, using deque, 
is *not* correct as provided, as it fails to work with iterables. It 
requires the caller to pass only iterators, unlike the existing 
roundrobin recipe which accepts any iterable.

Nor is it more efficient, at least on my machine -- in fact the 
opposite, it is the worst performing of the four recipes I've tried:

- the current recipe from the itertools docs;
- your re-write, using zip_longest;
- Terry's version;
- and the one from stackoverflow.

I've attached my test code, in case you want to play around with it. 
Apologies in advance for any bugs in the test code (its 2 in the 
morning here and I've had a long day).

According to my testing, on my computer using Python 3.5, Terry's code 
is by far the fastest in all three separate test cases, but that 
probably shouldn't count since it's buggy (it truncates the results and 
bails out early under some circumstances). Out of the implementations 
that don't truncate, the existing recipe is by far the fastest.

Terry, if you're reading this, try:

list(roundrobin('A', 'B', 'CDE'))

Your version truncates the results to A B C instead of A B C D E as the 
itertools recipe gives.

But I digress... the point is, if you're going to make claims about the 
correctness and efficiency of code, you really ought to actually check 
the correctness and efficiency first.

I may or may not get around to responding to some of your other points 
over the next few days.

-- 
Steve
-------------- next part --------------
import itertools

from itertools import *
from timeit import Timer

# From the itertools recipe
def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    pending = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while pending:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            pending -= 1
            nexts = cycle(islice(nexts, pending))

def bunslow(*iters):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    sentinel = object()
    for tup in itertools.zip_longest(*iters, fillvalue=sentinel):
        yield from (x for x in tup if x is not sentinel)

def terry(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    nexts = cycle(iter(it).__next__ for it in iterables)
    for reduced_len in reversed(range(1, len(iterables))):
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            nexts = cycle(islice(nexts, reduced_len))

from collections import deque

def stackoverflow(*iterators):
    queue = deque(iter(it) for it in iterators)
    while len(queue) > 0:
        iterator = queue.popleft()
        try:
            yield next(iterator)
            queue.append(iterator)
        except StopIteration:
            pass

FUNCTIONS = (roundrobin, bunslow, terry, stackoverflow)

equal_data = [range(500)]*4
unequal_data = [range(500), range(600), range(700), range(400)]
extreme_data = [range(1), range(1), range(1), range(500)]

def trial(data_name, number, repeat):
    print("testing with %s" % data_name)
    template = 'from __main__ import %s as rr, %s as data'
    data = globals()[data_name]
    assert len(data) == 4 and all(type(x) is range for x in data)
    expected = list(roundrobin(*data))
    failed = []
    for func in FUNCTIONS:
        actual = list(func(*data))
        if actual != expected:
            failed.append(func)
        setup = template % (func.__name__, data_name)
        T = Timer('list(rr(*data))', setup=setup)
        t = min(T.repeat(number=number, repeat=repeat))
        print('  %-15s %f' % (func.__name__, t))
    for f in failed:
        print('  *** %s returned incorrect results ***' % f.__name__)

NUMBER = 30
REPEAT = 5

trial('equal_data', NUMBER, REPEAT)
print("*"*50)
trial('unequal_data', NUMBER, REPEAT)
print("*"*50)
trial('extreme_data', NUMBER, REPEAT)