[Python-ideas] Fast sum() for non-numbers

Wed Jul 3 18:58:04 CEST 2013

On 3 July 2013 13:43, Steven D'Aprano <steve at pearwood.info> wrote:
> On 03/07/13 04:12, Sergey wrote:
>>
>> Hello, python-ideas. Trying to cover everything in one shot, so
>> the message is long, sorry.
>>
>> sum() is a great function. It is the "obvious way" to add things.
>> Unfortunately sometimes it's slower than it could be.
>>
>> The problem is that code:
>>    sum([[1,2,3]]*1000000, [])
>> takes forever to complete. Let's fix that!
>
>
> I'm not sure that sum() is the Obvious Way to concatenate lists, and I don't
> think that concatenating many lists is a common thing to do. Traditionally,
> sum() works only on numbers, and I think we wouldn't be having this
> discussion if Python used & for concatenation instead of +. So I don't care
> that sum() has quadratic performance on lists (and tuples), and I must admit
> that having a simple quadratic algorithm in the built-ins is sometimes
> useful for teaching purposes, so I'm -0 on optimizing this case.

This optimises all circumstances where iadd is faster than add. It
makes sense to me.

>> What I suggest is instead of making a copy for every item make just one
>> copy and reuse it as many times as needed. For example:
>>    def sum(seq, start = 0):
>>      start = start + seq[0]
>>      for item in seq[1:]:
>>        start += item
>>      return start
>
> [...]
>
>> Can this implementation break anything?
>
>
> That will still have quadratic performance for tuples, and it will break if
> seq is an iterator.
>
> It will also be expensive to make a slice of seq if it is a huge list, and
> could even run out of memory to hold both the original and the slice. I
> would be annoyed if sum started failing with a MemoryError here:
>
> big_seq = list(range(1000))*1000000  # assume this succeeds
> total = sum(big_seq)  # could fail

Then you can just make it support iterators:

def sum(seq, start = 0):
  seq = iter(seq)
  start = start + next(seq)
  for item in seq:
    start += item
  return start

>> Alternatives to sum() for this use-case:
>> * list comprehension is 270% slower than patched sum [2]
>> * itertools.chain is 50%-100% slower than patched sum [3]
>
>
> How about the "Obvious Way" to concatenate lists?
>
> new = []
> for x in seq:
>     new.extend(x)

What's wrong with sum, if sum is fast?

> -0 on the general idea, -1 on the specific implementation. I'd rather have
> sum of lists be slow than risk sum of numbers raise MemoryError.

How about with my version? I see no obvious downsides, personally.