[Python-ideas] Fast sum() for non-numbers

Wed Jul 3 05:24:02 CEST 2013

Such a function is very tiny:

 >>> import operator
 >>> isum = lambda *args: reduce(operator.iadd,*args)

But this might be unexpected:

 >>> l = []
 >>> l2 = isum([[1,2,3]]*1000000, l)

l is now changed. In fact l == l2. I guess this could cause more troubles than
it's good for if the user expects tha current behaviour. I don't think such an
incompatible API change can ever be made. But one could maybe include isum,
maybe just as recipe in the documentation or in itertools or somewhere.

On 07/02/2013 08:12 PM, Sergey wrote:
> Hello, python-ideas. Trying to cover everything in one shot, so
> the message is long, sorry.
>
> sum() is a great function. It is the "obvious way" to add things.
> Unfortunately sometimes it's slower than it could be.
>
> The problem is that code:
>    sum([[1,2,3]]*1000000, [])
> takes forever to complete. Let's fix that!
>
> This problem was there since the sum() introduction, but it is almost
> unknown among python-beginners.
>
> When people look at sum(seq,start=0) signature they most probably
> expect it to be like this:
>    def sum(seq, start = 0):
>      for item in seq:
>        start += item
>      return start
>
> But it is not, because in cases like:
>    empty = []
>    result = sum(list_of_lists, empty)
> such implementation would modify content of "empty". This use-case is
> known, it is checked in test_builtin.py, and explicitly mentioned in
> comments inside sum() code.
>
> So actually sum() looks like this:
>    def sum(seq, start = 0):
>      for item in seq:
>        start = start + item
>      return start
> it creates a copy of the partial result on every "start + item".
>
> What I suggest is instead of making a copy for every item make just one
> copy and reuse it as many times as needed. For example:
>    def sum(seq, start = 0):
>      start = start + seq[0]
>      for item in seq[1:]:
>        start += item
>      return start
>
> Patch implementing this idea attached to issue18305. Patch is simple, it
> just rearranges existing code. It should work for python 2.7, 3.3 and
> hg-tip. Except sum() becoming faster there should be no behavior change.
> Can this implementation break anything?
>
> Advantages:
> * Performance boost is like 200000% [1], nothing else becomes slower
>
> Disadvantages (inspired by Terry J. Reedy):
> * Other pythons (pypy, jython, older cpython versions) do not have
>    such optimisation, people will move code that depends on the internal
>    optimization to pythons that do not have it. And they will complain
>    about their program 'freezing'.
>
> * It discourages people from carefully thinking about whether they
>    actually need a concrete list or merely the iterator for a virtual
>    list.
>
> Alternatives to sum() for this use-case:
> * list comprehension is 270% slower than patched sum [2]
> * itertools.chain is 50%-100% slower than patched sum [3]
>
> Main questions:
> * Whether to accept the patch
> * If yes, whether it should go to the new version or to a bugfix release
>
> Alternatives to this patch:
> * Reject the idea as a whole. Intentionally keep the sum() slow.
> * Accept the idea, but reject this particular implementation, instead use
>    something different, e.g. implement special case optimization or use
>    different approach (I thought about start=copy.copy(start))
>
> In my opinion (looking though python changelog) performance patches were
> accepted for bugfix releases many times before. So as long as this patch
> does not break anything, it may go even to python 2.7.
>
> I think bad performance is a bug that should be fixed. Rejecting
> performance fixes just because older versions don't have it would mean
> no performance improvements to anything, ever.
>
> Sum is a great basic function. It keeps the simple code simple. This
> patch puts it on par with more advanced features like itertools, and
> allows people to use sum for simple things and itertools for complex
> memory-efficient algorithms.
>
> Questions that may arise if the patch is accepted:
> * sum() was rejecting strings because of this bug. If the bug gets fixed
>    should another patch allow sum() to accept strings?
> * maybe in some distant future drop the second parameter (or make it
>    None by default) and allow calling sum for everything, making sum()
>    "the one obvious way" to sum up things?
>
> It would be nice if sum "just worked" for everything (e.g. sum() of
> empty sequence would return None, i.e. if there's nothing to sum then
> nothing is returned). But I think it needs more work for that, because
> even with this patch sum() is still ~20 times slower than "".join() [4]
>
> That's all. Any suggestions welcome.
>