sum for sequences?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Mar 27 06:18:35 EDT 2010
On Fri, 26 Mar 2010 07:31:17 -0700, Steve Howell wrote:
> From a purely academic standpoint, I'm not convinced that sum() is
> inefficient in terms of big-O complexity, though.
>
> showell at showell-laptop:~$ python
> Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on
> linux2
> >>> class StupidList:
[...]
But it's not *sum* that is inefficient, it is sum *with a particular data
structure*.
Sure, if you create your own data structure, you can make it as efficient
as you like. Obviously the sum algorithm itself has to perform one
addition per item, or O(N), which scales tolerably well. But each
addition has a cost. If the cost is constant, then sum() as a whole
remains O(N). But if the cost of addition varies with N, sum() degrades
badly.
We can compare the performance of sum with different data structures,
starting with plain integers versus long integers:
>>> from timeit import Timer
>>> setup = 'data = [%d]*%d'
>>> for i in range(6):
... t1 = Timer('sum(data, 0)', setup % (1, 10**i))
... t2 = Timer('sum(data, 0)', setup % (10**50, 10**i))
... print min(t1.repeat(number=1000)),
... print min(t2.repeat(number=1000))
...
0.00179290771484 0.00263810157776
0.00340414047241 0.00854396820068
0.0190401077271 0.0502791404724
0.155302047729 0.645124912262
0.794432878494 2.55748295784
7.97877693176 25.3812758923
Both scale about as well, but the cost varies significantly: arithmetic
on very large longints is expensive.
Strings, with a trick to fool sum into accepting them, versus lists. Note
that I changed the number of iterations from 6 down to 5. The reason why
will become obvious:
>>> class EmptyStringStarter:
... def __add__(self, other):
... return other
...
>>> empty = EmptyStringStarter()
>>> setup = """from __main__ import empty; data = [%r]*%d"""
>>>
>>> for i in range(5):
... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
... t2 = Timer('sum(data, [])', setup % ([1], 10**i))
... print min(t1.repeat(number=1000)),
... print min(t2.repeat(number=1000))
...
0.00849103927612 0.00226998329163
0.0121459960938 0.0082700252533
0.0489149093628 0.186735153198
0.428920030594 5.28623914719
14.6552250385 589.102822065
Strings perform tolerably well, up to a point, but lists perform
terribly. And in fact, the relatively good performance of strings is an
artifact of recent versions of CPython. In Jython and IronPython, and
older versions of CPython, it will behave as poorly as lists.
> I wonder how severely sum(), without
> the restriction, would underperform join() on modern versions of Python,
> though.
Take note that, in order to get an answer in reasonable time, I've
reduced the number of timing iterations drastically:
>>> for i in range(6):
... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
... t2 = Timer('"".join(data)', setup % ('a', 10**i))
... print min(t1.repeat(number=10)),
... print min(t2.repeat(number=10))
...
8.89301300049e-05 1.09672546387e-05
0.000131845474243 2.19345092773e-05
0.000591993331909 9.29832458496e-05
0.0101289749146 0.00082802772522
0.365957021713 0.00884819030762
24.2072279453 0.0421011447906
Note the performance degradation of sum. It gets worse. Much worse:
>>> for i in range(4, 7):
... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
... t2 = Timer('"".join(data)', setup % ('a', 10**i))
... print min(t1.repeat(number=1)), # note fewer iterations
... print min(t2.repeat(number=1))
...
0.031229019165 0.000817060470581
2.45445990562 0.00365781784058
1024.79705095 0.0398509502411
This is absolutely catastrophic performance degradation.
--
Steven
More information about the Python-list
mailing list