sum for sequences?

Sat Mar 27 06:18:35 EDT 2010

On Fri, 26 Mar 2010 07:31:17 -0700, Steve Howell wrote:

> From a purely academic standpoint, I'm not convinced that sum() is
> inefficient in terms of big-O complexity, though.
> 
>  showell at showell-laptop:~$ python
>  Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on
>  linux2
>  >>> class StupidList:
[...]

But it's not *sum* that is inefficient, it is sum *with a particular data 
structure*.

Sure, if you create your own data structure, you can make it as efficient 
as you like. Obviously the sum algorithm itself has to perform one 
addition per item, or O(N), which scales tolerably well. But each 
addition has a cost. If the cost is constant, then sum() as a whole 
remains O(N). But if the cost of addition varies with N, sum() degrades 
badly.

We can compare the performance of sum with different data structures, 
starting with plain integers versus long integers:

>>> from timeit import Timer
>>> setup = 'data = [%d]*%d'
>>> for i in range(6):
...     t1 = Timer('sum(data, 0)', setup % (1, 10**i))
...     t2 = Timer('sum(data, 0)', setup % (10**50, 10**i))
...     print min(t1.repeat(number=1000)),
...     print min(t2.repeat(number=1000))
...
0.00179290771484 0.00263810157776
0.00340414047241 0.00854396820068
0.0190401077271 0.0502791404724
0.155302047729 0.645124912262
0.794432878494 2.55748295784
7.97877693176 25.3812758923

Both scale about as well, but the cost varies significantly: arithmetic 
on very large longints is expensive.

Strings, with a trick to fool sum into accepting them, versus lists. Note 
that I changed the number of iterations from 6 down to 5. The reason why 
will become obvious:

>>> class EmptyStringStarter:
...     def __add__(self, other):
...         return other
... 
>>> empty = EmptyStringStarter()
>>> setup = """from __main__ import empty; data = [%r]*%d"""
>>> 
>>> for i in range(5):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('sum(data, [])', setup % ([1], 10**i))
...     print min(t1.repeat(number=1000)),
...     print min(t2.repeat(number=1000))
...
0.00849103927612 0.00226998329163
0.0121459960938 0.0082700252533
0.0489149093628 0.186735153198
0.428920030594 5.28623914719
14.6552250385 589.102822065

Strings perform tolerably well, up to a point, but lists perform 
terribly. And in fact, the relatively good performance of strings is an 
artifact of recent versions of CPython. In Jython and IronPython, and 
older versions of CPython, it will behave as poorly as lists.

> I wonder how severely sum(), without
> the restriction, would underperform join() on modern versions of Python,
> though.

Take note that, in order to get an answer in reasonable time, I've 
reduced the number of timing iterations drastically:

>>> for i in range(6):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('"".join(data)', setup % ('a', 10**i))
...     print min(t1.repeat(number=10)),
...     print min(t2.repeat(number=10))
...
8.89301300049e-05 1.09672546387e-05
0.000131845474243 2.19345092773e-05
0.000591993331909 9.29832458496e-05
0.0101289749146 0.00082802772522
0.365957021713 0.00884819030762
24.2072279453 0.0421011447906

Note the performance degradation of sum. It gets worse. Much worse:

>>> for i in range(4, 7):
...     t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
...     t2 = Timer('"".join(data)', setup % ('a', 10**i))
...     print min(t1.repeat(number=1)),  # note fewer iterations
...     print min(t2.repeat(number=1))
...
0.031229019165 0.000817060470581
2.45445990562 0.00365781784058
1024.79705095 0.0398509502411

This is absolutely catastrophic performance degradation.

-- 
Steven