[Python-Dev] Fwd: summing a bunch of numbers (or "whatevers")

Alex Martelli aleax@aleax.it
Mon, 21 Apr 2003 12:52:32 +0200

On Monday 21 April 2003 10:52 am, Alex Martelli wrote:
> Aye aye, cap'n -- now that youve crossed the i's and dotted the t's
> I'll arrange the complete patch with tests and docs and submit it
> forthwith.

Done -- patch 724936 on SF, assigned to gvanrossum with priority 7
as you said to do for patches meant for 2.3beta1.

As I've remarked in the patch's comments, there's something of a
performance hit now with sum(manystrings) wrt ''.join(manystrings):

[alex@lancelot Lib]$ ../python -O timeit.py -s'L=map(str,range(999))' 
-s'import operator'  'sum(L)'
10000 loops, best of 3: 174 usec per loop

[alex@lancelot Lib]$ ../python -O timeit.py -s'L=map(str,range(999))' 
-s'import operator'  '"".join(L)'
10000 loops, best of 3: 75 usec per loop

[alex@lancelot Lib]$ ../python -O timeit.py -s'L=map(str,range(999))' 
-s'import operator'  'reduce(operator.add,L)'
1000 loops, best of 3: 1.35e+03 usec per loop

[alex@lancelot Lib]$ ../python -O timeit.py -s'L=map(str,range(999))' 
-s'import operator'  'tot=""' 'for it in L: tot+=it'
1000 loops, best of 3: 1.33e+03 usec per loop

Nowhere as bad as the unbounded slowdown with operator.add
or the equivalent loop, but still, a solid slowdown of a factor of two.

Problem is that the argument to sum MIGHT be an iterator (not a
"normal" sequence) so sum must save the first item and concat
the _PyString_Join of the OTHER items after the first (my unit
tests were originally lax and only exercised sum with the argument
being a list -- fortunately I beefed up the unit tests as part of
preparing the patch for submission, so they caught this, as well
as the issue with sum of a sequence mixing unicode and plain
string items, which forces sum to use different concatenation code
depending on the exact type of the first item...).

Reasoning on this, and on "If the implementation is hard to explain, 
it's a bad idea", I'm starting to doubt my original intuition that "of
course" sum should be polymorphic over sequences of any type
supporting "+" -- maybe Tim Peters' concept that sum should be
restricted to sequences of numbers is sounder -- it's irksome that,
of sum's 50 lines of C, 14 should deal with the special case of
"sequence of strings" and STILL involve a factor-of-2 performance
hit wrt ''.join!  However, HOW do we catch attempts to use sum on
a sequence of strings while still allowing the use case of a sequence
of timedeltas?  [maybe a timedelta SHOULD be summable to 0
and we could take the 'summable to 0' as a test of numberhood?-)]

I don't know, so, I've submitted the patch as it stands, and I hope
somebody can suggest a better solution - I just PRAY that sum
won't accept a sequence of strings AND sum them up with + ,
thus perpetuating the "newbie performance trap" of that idiom!-)