itertools.flatten()? and copying generators/iterators.

Tue Oct 28 09:19:40 EST 2003

Francis Avila wrote:
   ...
> Actually, I meant the exception testing. Isn't Alex suggesting something
> like?:
> 
> try:
>     seq+''
> except:
>     seq is not a string
> else:
>     seq is a string

Right.  Yes, if you have a lot of nonstrings this IS going to be
substantially slower than isinstance tests.  As usual, setting up
a representative benchmark and measuring the effect is best.

Of course, it's not easy to decide what IS representative, but
let's give it a try:

# Peter Otten's suggestion (simplified/flattened out)
def flatten_po(s):
    try:
        if isinstance(s, basestring):
            raise TypeError
        else:
            it = iter(s)
    except TypeError:
        yield s
    else:
        for elem in it:
            for subelem in flatten_po(elem):
                yield subelem

# my slight preference
def flatten_am(s):
    try:
        s+''
    except TypeError:
        try: it = iter(s)
        except TypeError:
            yield s
        else:
            for elem in s:
                for subelem in flatten_am(elem):
                    yield subelem
    else:
        yield s

# an example tree -- a mix of strings, items, some nesting
subtree = ['foo']*18 + [1,2,3]*6
tree = [ subtree*10, [ subtree * 8 ] ]

# functions to benchmark
def process(flatten, tree=tree):
    return list(flatten(tree))

def pro_po(): return process(flatten_po)
def pro_am(): return process(flatten_am)

Now, with this setup, I measure:

[alex at lancelot bo]$ timeit.py -c -s'import fl' 'fl.pro_po()'
100 loops, best of 3: 9.4e+03 usec per loop
[alex at lancelot bo]$ timeit.py -c -s'import fl' 'fl.pro_am()'
100 loops, best of 3: 8.3e+03 usec per loop

...which is a reminder of WHY it's best to always measure --
I fully expected pro_am to be slower, as it's more general
(handles UserString instances etc), and I was just trying
to check _how_ much slower it was...!

Upon reflection, the problem is with flatten_po elegant
'raise TypeError', which merges strings with noniterables
BUT only at the cost of an exception.  Changing the
preample of flatten_po to:

    try:
        if isinstance(s, basestring):
            yield s
            return
        ...

improves its timeit.py-measured performance to 5.9e+03 usec.
So, on this sample of data, the _po version is either 13%
slower or 29% faster than the _am .

Now, fortunately, it's an EXTREMELY unlikely event that
performance differences of this order of magnitude should
deeply influence our choice of algorithms!  Twice as slow,
well, one wants to keep an eye on that; 30% or less, it's
unlikely to matter except at the very heart of some key
bottleneck -- and, honestly, how often will flattening
overhead be THE bottleneck of your application...?  Note
that in this case we're basically doing no processing with
the items in the flattened sequence, just stuffing them into
flat lists.  Do any processing whatsoever, and that will
probably dwarf the infrastructural overhead of the flattening
procedure itself.

The key thing is watching for difference in big-O behavior --
and, here, we have none... just somewhat different (NOT
by whole orders of magnitude!) multiplicative constants.
THAT kind of performance issue is one we can live with
more often than not!  Which, in turn, is what makes the
ordinary use of exceptions generally tolerable from the
point of view of performance, except in the very hottest
bottlenecks...

Alex