[Python-ideas] Fast sum() for non-numbers

Wed Jul 3 06:40:23 CEST 2013

On Jul 2, 2013, at 20:24, Mathias Panzenböck <grosser.meister.morti at gmx.net> wrote:

> Such a function is very tiny:
> 
> >>> import operator
> >>> isum = lambda *args: reduce(operator.iadd,*args)
> 
> But this might be unexpected:
> 
> >>> l = []
> >>> l2 = isum([[1,2,3]]*1000000, l)
> 
> l is now changed. In fact l == l2.

He explicitly suggested making one copy, before looping. So, this:

   sum([[1,2,3]]*1000000, l)

Has to mean:

    reduce(operator.iadd, [[1,2,3]]*1000000, copy.copy(l))

So, it's not quite a one-liner, and it doesn't have this problem.

> But one could maybe include isum,
> maybe just as recipe in the documentation or in itertools or somewhere.

This sounds like a good idea. Possibly even both versions, with and without the copy.

While we're at it, I've always wished sum and reduce used the same name for their start/initial parameter. If we're going to be changing one (which I suspect we probably aren't, but just in case...), is this a good time to campaign for renaming the param?

> On 07/02/2013 08:12 PM, Sergey wrote:
>> Hello, python-ideas. Trying to cover everything in one shot, so
>> the message is long, sorry.
>> 
>> sum() is a great function. It is the "obvious way" to add things.
>> Unfortunately sometimes it's slower than it could be.
>> 
>> The problem is that code:
>>   sum([[1,2,3]]*1000000, [])
>> takes forever to complete. Let's fix that!
>> 
>> This problem was there since the sum() introduction, but it is almost
>> unknown among python-beginners.
>> 
>> When people look at sum(seq,start=0) signature they most probably
>> expect it to be like this:
>>   def sum(seq, start = 0):
>>     for item in seq:
>>       start += item
>>     return start
>> 
>> But it is not, because in cases like:
>>   empty = []
>>   result = sum(list_of_lists, empty)
>> such implementation would modify content of "empty". This use-case is
>> known, it is checked in test_builtin.py, and explicitly mentioned in
>> comments inside sum() code.
>> 
>> So actually sum() looks like this:
>>   def sum(seq, start = 0):
>>     for item in seq:
>>       start = start + item
>>     return start
>> it creates a copy of the partial result on every "start + item".
>> 
>> What I suggest is instead of making a copy for every item make just one
>> copy and reuse it as many times as needed. For example:
>>   def sum(seq, start = 0):
>>     start = start + seq[0]
>>     for item in seq[1:]:
>>       start += item
>>     return start
>> 
>> Patch implementing this idea attached to issue18305. Patch is simple, it
>> just rearranges existing code. It should work for python 2.7, 3.3 and
>> hg-tip. Except sum() becoming faster there should be no behavior change.
>> Can this implementation break anything?
>> 
>> Advantages:
>> * Performance boost is like 200000% [1], nothing else becomes slower
>> 
>> Disadvantages (inspired by Terry J. Reedy):
>> * Other pythons (pypy, jython, older cpython versions) do not have
>>   such optimisation, people will move code that depends on the internal
>>   optimization to pythons that do not have it. And they will complain
>>   about their program 'freezing'.
>> 
>> * It discourages people from carefully thinking about whether they
>>   actually need a concrete list or merely the iterator for a virtual
>>   list.
>> 
>> Alternatives to sum() for this use-case:
>> * list comprehension is 270% slower than patched sum [2]
>> * itertools.chain is 50%-100% slower than patched sum [3]
>> 
>> Main questions:
>> * Whether to accept the patch
>> * If yes, whether it should go to the new version or to a bugfix release
>> 
>> Alternatives to this patch:
>> * Reject the idea as a whole. Intentionally keep the sum() slow.
>> * Accept the idea, but reject this particular implementation, instead use
>>   something different, e.g. implement special case optimization or use
>>   different approach (I thought about start=copy.copy(start))
>> 
>> In my opinion (looking though python changelog) performance patches were
>> accepted for bugfix releases many times before. So as long as this patch
>> does not break anything, it may go even to python 2.7.
>> 
>> I think bad performance is a bug that should be fixed. Rejecting
>> performance fixes just because older versions don't have it would mean
>> no performance improvements to anything, ever.
>> 
>> Sum is a great basic function. It keeps the simple code simple. This
>> patch puts it on par with more advanced features like itertools, and
>> allows people to use sum for simple things and itertools for complex
>> memory-efficient algorithms.
>> 
>> Questions that may arise if the patch is accepted:
>> * sum() was rejecting strings because of this bug. If the bug gets fixed
>>   should another patch allow sum() to accept strings?
>> * maybe in some distant future drop the second parameter (or make it
>>   None by default) and allow calling sum for everything, making sum()
>>   "the one obvious way" to sum up things?
>> 
>> It would be nice if sum "just worked" for everything (e.g. sum() of
>> empty sequence would return None, i.e. if there's nothing to sum then
>> nothing is returned). But I think it needs more work for that, because
>> even with this patch sum() is still ~20 times slower than "".join() [4]
>> 
>> That's all. Any suggestions welcome.
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas