sum() vs. loop

Wed Oct 13 12:49:55 EDT 2021

On Mon, 11 Oct 2021 at 23:00, Christian Gollwitzer <auriocus at gmx.de> wrote:
>
> Am 10.10.21 um 10:49 schrieb Steve Keller:
> > I have found the sum() function to be much slower than to loop over the
> > operands myself:
> >
> > def sum_products(seq1, seq2):
> >      return sum([a * b for a, b in zip(seq1, seq2)])
> >
> > def sum_products2(seq1, seq2):
> >      sum = 0
> >      for a, b in zip(seq1, seq2):
> >          sum += a * b
> >      return sum
> >
> > In a program I generate about 14 million pairs of sequences of ints each
> > of length 15 which need to be summed.  The first version with sum() needs
> > 44 seconds while the second version runs in 37 seconds.
>
> The first version constructs a list, sums it up and throws the list
> away, while the second version only keeps the running sum in memory. How
> about a generator expression instead, i.e.
>
>
>              sum((a * b for a, b in zip(seq1, seq2)))

What really matters in cases like this is interpreter overhead.
Typically a generator expression has slightly more overhead compared
to a list comprehension. You get a slightly slower per-item speed in
exchange for O(1) memory consumption.

A list comprehension like

result = sum([expr for foo in bar])

Translates internally to something like:

def _func():
    data = []
    for foo in bar:
        data.append(foo)
    return foo

_stuff = _func()
result = sum(_stuff)

Running this really does bring in the overhead of a function call
_func() because it needs to create a frame with locals and so on.
However it is only the cost of a single function call. Afterwards if
the elements in the list _stuff are built in types like int etc with
builtin __add__ methods then sum(_stuff) does not involve the
interpreter at all: it's a built-in function operating on a built-in
container of built-in objects.

With a generator expression like

result = sum(expr for foo in bar)

This translates internally to

def _genfunc():
    for foo in bar:
        yield foo

_gen = _genfunc()
result = sum(_gen)

Now the _genfunc() function call simply creates the generator and sets
up its frame which I think is more or less equivalent to the overhead
of the _func() function call. However the sum(_gen) line is no longer
a pure built-in operation. Internally each time sum calls
_gen.__next__() the execution frame for _genfunc() is reactivated so
that the interpreter can execute the bytecodes taking the frame from
one yield to the next. This is almost the overhead of a function call
for each item in bar although this is often unnoticeable in practice
and other factors can affect this.

The fastest version eliminates the interpreter altogether at least
when operating on pure built-in elements:

In [7]: nums1 = nums2 = list(range(10**5))

In [10]: %timeit sum([a*b for a, b in zip(nums1, nums2)])
9.83 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sum((a*b for a, b in zip(nums1, nums2)))
10.3 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: from operator import mul

In [13]: %timeit sum(map(mul, nums1, nums2))
7.25 ms ± 33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Of course if the elements being multiplied and summed have pure-Python
__add__/__mul__ methods or the container has a pure Python
__iter__/__next__ then any of those methods will typically dominate
the runtime over any of the overheads considered here.

--
Oscar