[Python-ideas] Fast sum() for non-numbers
Andrew Barnert
abarnert at yahoo.com
Wed Jul 3 06:40:23 CEST 2013
On Jul 2, 2013, at 20:24, Mathias Panzenböck <grosser.meister.morti at gmx.net> wrote:
> Such a function is very tiny:
>
> >>> import operator
> >>> isum = lambda *args: reduce(operator.iadd,*args)
>
> But this might be unexpected:
>
> >>> l = []
> >>> l2 = isum([[1,2,3]]*1000000, l)
>
> l is now changed. In fact l == l2.
He explicitly suggested making one copy, before looping. So, this:
sum([[1,2,3]]*1000000, l)
Has to mean:
reduce(operator.iadd, [[1,2,3]]*1000000, copy.copy(l))
So, it's not quite a one-liner, and it doesn't have this problem.
> But one could maybe include isum,
> maybe just as recipe in the documentation or in itertools or somewhere.
This sounds like a good idea. Possibly even both versions, with and without the copy.
While we're at it, I've always wished sum and reduce used the same name for their start/initial parameter. If we're going to be changing one (which I suspect we probably aren't, but just in case...), is this a good time to campaign for renaming the param?
> On 07/02/2013 08:12 PM, Sergey wrote:
>> Hello, python-ideas. Trying to cover everything in one shot, so
>> the message is long, sorry.
>>
>> sum() is a great function. It is the "obvious way" to add things.
>> Unfortunately sometimes it's slower than it could be.
>>
>> The problem is that code:
>> sum([[1,2,3]]*1000000, [])
>> takes forever to complete. Let's fix that!
>>
>> This problem was there since the sum() introduction, but it is almost
>> unknown among python-beginners.
>>
>> When people look at sum(seq,start=0) signature they most probably
>> expect it to be like this:
>> def sum(seq, start = 0):
>> for item in seq:
>> start += item
>> return start
>>
>> But it is not, because in cases like:
>> empty = []
>> result = sum(list_of_lists, empty)
>> such implementation would modify content of "empty". This use-case is
>> known, it is checked in test_builtin.py, and explicitly mentioned in
>> comments inside sum() code.
>>
>> So actually sum() looks like this:
>> def sum(seq, start = 0):
>> for item in seq:
>> start = start + item
>> return start
>> it creates a copy of the partial result on every "start + item".
>>
>> What I suggest is instead of making a copy for every item make just one
>> copy and reuse it as many times as needed. For example:
>> def sum(seq, start = 0):
>> start = start + seq[0]
>> for item in seq[1:]:
>> start += item
>> return start
>>
>> Patch implementing this idea attached to issue18305. Patch is simple, it
>> just rearranges existing code. It should work for python 2.7, 3.3 and
>> hg-tip. Except sum() becoming faster there should be no behavior change.
>> Can this implementation break anything?
>>
>> Advantages:
>> * Performance boost is like 200000% [1], nothing else becomes slower
>>
>> Disadvantages (inspired by Terry J. Reedy):
>> * Other pythons (pypy, jython, older cpython versions) do not have
>> such optimisation, people will move code that depends on the internal
>> optimization to pythons that do not have it. And they will complain
>> about their program 'freezing'.
>>
>> * It discourages people from carefully thinking about whether they
>> actually need a concrete list or merely the iterator for a virtual
>> list.
>>
>> Alternatives to sum() for this use-case:
>> * list comprehension is 270% slower than patched sum [2]
>> * itertools.chain is 50%-100% slower than patched sum [3]
>>
>> Main questions:
>> * Whether to accept the patch
>> * If yes, whether it should go to the new version or to a bugfix release
>>
>> Alternatives to this patch:
>> * Reject the idea as a whole. Intentionally keep the sum() slow.
>> * Accept the idea, but reject this particular implementation, instead use
>> something different, e.g. implement special case optimization or use
>> different approach (I thought about start=copy.copy(start))
>>
>> In my opinion (looking though python changelog) performance patches were
>> accepted for bugfix releases many times before. So as long as this patch
>> does not break anything, it may go even to python 2.7.
>>
>> I think bad performance is a bug that should be fixed. Rejecting
>> performance fixes just because older versions don't have it would mean
>> no performance improvements to anything, ever.
>>
>> Sum is a great basic function. It keeps the simple code simple. This
>> patch puts it on par with more advanced features like itertools, and
>> allows people to use sum for simple things and itertools for complex
>> memory-efficient algorithms.
>>
>> Questions that may arise if the patch is accepted:
>> * sum() was rejecting strings because of this bug. If the bug gets fixed
>> should another patch allow sum() to accept strings?
>> * maybe in some distant future drop the second parameter (or make it
>> None by default) and allow calling sum for everything, making sum()
>> "the one obvious way" to sum up things?
>>
>> It would be nice if sum "just worked" for everything (e.g. sum() of
>> empty sequence would return None, i.e. if there's nothing to sum then
>> nothing is returned). But I think it needs more work for that, because
>> even with this patch sum() is still ~20 times slower than "".join() [4]
>>
>> That's all. Any suggestions welcome.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
More information about the Python-ideas
mailing list