[Tutor] Better way to insert items into a list

Steven D'Aprano steve at pearwood.info
Sun Dec 9 02:53:49 CET 2012


On 09/12/12 10:45, Mike wrote:
> Thank you.  Works well.

Actually, no, it doesn't work well. You only think it works well because
you've only tested it with tiny amounts of data.


> On Sat, Dec 8, 2012 at 4:18 PM, bob gailer<bgailer at gmail.com>  wrote:
>> tuple(sum([list(x) for x in zip(a,[c]*len(a))],[]))


If you test that with large amounts of data, you will notice that this
is horribly, painfully slow and inefficient.

Here, I set up some sample data for testing:


py> from string import ascii_letters
py> input = []
py> for a in ascii_letters:
...     for b in ascii_letters:
...             input.append(a+b)
...
py> input = input*10
py> len(input)
27040
py> print input[:5], input[-5:]
['aa', 'ab', 'ac', 'ad', 'ae'] ['ZV', 'ZW', 'ZX', 'ZY', 'ZZ']


Now I time how long it takes to interleave values using the two general
purpose functions I gave earlier:


py> with Timer():
...     t = tuple(interleave(input, ['00']*len(input)))
...
time taken: 0.033014 seconds
py> with Timer():
...     t = tuple(longmux(input, [], fillvalue='00'))
...
time taken: 0.019947 seconds


So on my old and not terribly fast computer, we should be able to process
at least 1.3 million items per second. What does the solution using sum
do?

py> with Timer():
...     t = tuple(sum([list(x) for x in zip(input, ['00']*len(input))], []))
...
time taken: 4.607549 seconds


That is not a mistake -- it is over two hundred times slower, processing
fewer than 6000 items per second. And it actually gets slower as you use
more items.

So, what's going on here? The problem is that sum() adds lists, and repeatedly
adding lists is slow. Suppose we wanted to add four lists, each with two items:

[a, b] + [c, d] + [e, f] + [g, h]

Now, you or I would simply create one new list and append each item in turn:

[a, b] + [c, d] + [e, f] + [g, h]
=> [a, b, c, d, e, f, g, h]

so we would need to visit each item once. Since there are four lists with two
items each, we end up making *one* new list with *eight* items.

But that's not what Python can do. Python can only inspect one pair of lists
at a time. So what it ends up doing is this:

[a, b] + [c, d] + [e, f] + [g, h]
=> [a, b, c, d] + [e, f] + [g, h]  one new list with four items
=> [a, b, c, d, e, f] + [g, h]     second new list with six items
=> [a, b, c, d, e, f, g, h]        third new list with eight items

which gives, all up, 18 items that need to be touched. That's more than double
the number of items in the final list.

As the number of lists gets bigger, the amount of excess work increases even
more. With 100 lists being added, each of two items, the amount of work
Python has to do is enormous:

Ninety-eight intermediate lists are created, plus the final list which is
kept, and (4+6+8+...+198+200) items = 10100 items touched in total.

Except perhaps for a once-off, throw-away script, you should never use sum()
for adding lists, tuples, or strings. In fact, Python prohibits you from using
sum with string arguments:

py> sum("aaa".split(), "")
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: sum() can't sum strings [use ''.join(seq) instead]


specifically to avoid this same sort of painfully slow code.



If anyone is interested in the code for Timer() that I use above, you can
find it here:

http://code.activestate.com/recipes/577896




-- 
Steven


More information about the Tutor mailing list