# sum for sequences?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Mar 27 11:18:35 CET 2010

```On Fri, 26 Mar 2010 07:31:17 -0700, Steve Howell wrote:

> From a purely academic standpoint, I'm not convinced that sum() is
> inefficient in terms of big-O complexity, though.
>
>  showell at showell-laptop:~\$ python
>  Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on
>  linux2
>  >>> class StupidList:
[...]

But it's not *sum* that is inefficient, it is sum *with a particular data
structure*.

Sure, if you create your own data structure, you can make it as efficient
as you like. Obviously the sum algorithm itself has to perform one
addition per item, or O(N), which scales tolerably well. But each
addition has a cost. If the cost is constant, then sum() as a whole
remains O(N). But if the cost of addition varies with N, sum() degrades

We can compare the performance of sum with different data structures,
starting with plain integers versus long integers:

>>> from timeit import Timer
>>> setup = 'data = [%d]*%d'
>>> for i in range(6):
...     t1 = Timer('sum(data, 0)', setup % (1, 10**i))
...     t2 = Timer('sum(data, 0)', setup % (10**50, 10**i))
...     print min(t1.repeat(number=1000)),
...     print min(t2.repeat(number=1000))
...
0.00179290771484 0.00263810157776
0.00340414047241 0.00854396820068
0.0190401077271 0.0502791404724
0.155302047729 0.645124912262
0.794432878494 2.55748295784
7.97877693176 25.3812758923

Both scale about as well, but the cost varies significantly: arithmetic
on very large longints is expensive.

Strings, with a trick to fool sum into accepting them, versus lists. Note
that I changed the number of iterations from 6 down to 5. The reason why
will become obvious:

>>> class EmptyStringStarter:
...         return other
...
>>> empty = EmptyStringStarter()
>>> setup = """from __main__ import empty; data = [%r]*%d"""
>>>
>>> for i in range(5):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('sum(data, [])', setup % ([1], 10**i))
...     print min(t1.repeat(number=1000)),
...     print min(t2.repeat(number=1000))
...
0.00849103927612 0.00226998329163
0.0121459960938 0.0082700252533
0.0489149093628 0.186735153198
0.428920030594 5.28623914719
14.6552250385 589.102822065

Strings perform tolerably well, up to a point, but lists perform
terribly. And in fact, the relatively good performance of strings is an
older versions of CPython, it will behave as poorly as lists.

> I wonder how severely sum(), without
> the restriction, would underperform join() on modern versions of Python,
> though.

Take note that, in order to get an answer in reasonable time, I've
reduced the number of timing iterations drastically:

>>> for i in range(6):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('"".join(data)', setup % ('a', 10**i))
...     print min(t1.repeat(number=10)),
...     print min(t2.repeat(number=10))
...
8.89301300049e-05 1.09672546387e-05
0.000131845474243 2.19345092773e-05
0.000591993331909 9.29832458496e-05
0.0101289749146 0.00082802772522
0.365957021713 0.00884819030762
24.2072279453 0.0421011447906

Note the performance degradation of sum. It gets worse. Much worse:

>>> for i in range(4, 7):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('"".join(data)', setup % ('a', 10**i))
...     print min(t1.repeat(number=1)),  # note fewer iterations
...     print min(t2.repeat(number=1))
...
0.031229019165 0.000817060470581
2.45445990562 0.00365781784058
1024.79705095 0.0398509502411

This is absolutely catastrophic performance degradation.

--
Steven

```