PyPy doesn't make code written in C faster
Hi all, Some people learn about PyPy, and the first program they try to measure speed with is something like this: def factorial(n): res = 1 for i in range(1, n + 1): res *= i return res print factorial(25000) It may not be completely obvious a priori, but this is as bogus as it gets. This is by now only 50% slower in PyPy than in CPython thanks to efforts from various people. The issue is of course that it's an algo which, in CPython or in PyPy, spends most of its time in C code computing with rather large "long" objects. (No, PyPy doesn't contain magic to speed up C code 10 times.) In fact, this program spends more than 2/3rd of its time in the final repr() of the result! Converting a long to base 10 is a quadratic operation. Does it still make sense to add programs like this to our benchmarks? So far, our benchmarks are "real-life" examples. The benchmarks like above are completely missing the point of PyPy, as they don't stress at all the Python interpreter part. There are also other cases where PyPy's performance is very bad, like cpyext on an extension module with lots of small C API calls. I believe that it would still make sense to list such cases in the official benchmark, and have the descriptions of the benchmarks explain what's wrong with them. A bientôt, Armin.
On 30/05/13 11:23, Armin Rigo wrote:
... There are also other cases where PyPy's performance is very bad, like cpyext on an extension module with lots of small C API calls. I believe that it would still make sense to list such cases in the official benchmark, and have the descriptions of the benchmarks explain what's wrong with them.
On the other hand, there are also valid benchmark cases with very bad performance. From the top of my mind, reading a unicode text file was around 10-12 times slower (if i remember correctly) last time that i checked. l.
On Thu, May 30, 2013 at 10:23:17AM +0200, Armin Rigo wrote:
Hi all,
Some people learn about PyPy, and the first program they try to measure speed with is something like this:
def factorial(n): res = 1 for i in range(1, n + 1): res *= i return res print factorial(25000)
It may not be completely obvious a priori, but this is as bogus as it gets. This is by now only 50% slower in PyPy than in CPython thanks to efforts from various people. The issue is of course that it's an algo which, in CPython or in PyPy, spends most of its time in C code computing with rather large "long" objects. (No, PyPy doesn't contain magic to speed up C code 10 times.) In fact, this program spends more than 2/3rd of its time in the final repr() of the result! Converting a long to base 10 is a quadratic operation.
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm: def reclongtostr(x): if x < 0: return "-"+reclongtostr(-x) x = long(x) # expect a long min_digits = 9 # fits in 32 bits, there may be a better choice for this pts = [10**min_digits] while pts[-1] < x: pts.append(pts[-1]**2) pts.pop() # remove first 10**2**i greater than x output = [] def spl(x,i): if i < 0: # bottomed out with max_digit sized pieces if output or x > 0: s = str(x) output.append("0"*(min_digits - len(s)) + s) # note that this appends in inorder else: top,bot = divmod(x, pts[i]) # split the number spl(top,i-1) spl(bot,i-1) spl(x,len(pts)-1) # strip leading zeros, we can probably do this more elegantly while output[0][0] == "0": output[0] = output[0][1:] return ''.join(output) which benchmarks factorial(25000) like this: import time s = time.time() x = factorial(25000) print "factorial", time.time() - s sx = str(x) # give pypy a chance to compile s = time.time() sx = str(x) print "Str time", time.time() - s rsx = reclongtostr(x) # give pypy a chance to compile s = time.time() rsx = reclongtostr(x) print "my rec time", time.time() - s print "equal string:", sx == rsx factorial 0.182402133942 Str time 0.505062818527 my rec time 0.0678248405457 equal string: True I'm sure a better programmer than I could make this faster by avoiding saving intermediate results and various micro optimisations. But beating the builtin C implementation by a factor of 7.5 seems a reasonable outcome for pypy. I think I could come up with a linear time two pass algorithm working on intdigits if this were important to pypy.
Does it still make sense to add programs like this to our benchmarks? So far, our benchmarks are "real-life" examples. The benchmarks like above are completely missing the point of PyPy, as they don't stress at all the Python interpreter part. There are also other cases where PyPy's performance is very bad, like cpyext on an extension module with lots of small C API calls. I believe that it would still make sense to list such cases in the official benchmark, and have the descriptions of the benchmarks explain what's wrong with them.
I agree that you should include them, I disagree that they are 'wrong'. They measure the overhead of a C call. Why should a C call be slower in pypy than cpython? Presumably it could be compiled down to the appropriate instructions and then out-perform cpy. Now that the topic of benchmarks has come up, I came across this benchmark recently: http://dalkescientific.com/writings/diary/archive/2009/11/15/100000_tasklets... The same benchmark took 8.5s on pypy 2beta2 and takes 7.5s on pypy 2.0.1. Is there any obvious reasons why pypy's tasklets are so slow to switch? (Is it the scheduler?) This is important for my adoption of pypy at work. njh
2013/5/30 Nathan Hurst <njh@njhurst.com>
Does it still make sense to add programs like this to our benchmarks? So far, our benchmarks are "real-life" examples. The benchmarks like above are completely missing the point of PyPy, as they don't stress at all the Python interpreter part. There are also other cases where PyPy's performance is very bad, like cpyext on an extension module with lots of small C API calls. I believe that it would still make sense to list such cases in the official benchmark, and have the descriptions of the benchmarks explain what's wrong with them.
I agree that you should include them, I disagree that they are 'wrong'. They measure the overhead of a C call. Why should a C call be slower in pypy than cpython? Presumably it could be compiled down to the appropriate instructions and then out-perform cpy.
The C API here is the one of the CPython interpreter (PyLong_FromLong &co) To support it PyPy has to emulate many aspects, specially the fact that pypy objects are movable memory, and a PyObject* pointer is not supposed to change. To get fair benchmarks, those extension modules should be rewritten, with cffi for example: its C calls have very little overhead, and it integrates very well with the rest of the PyPy interpreter. -- Amaury Forgeot d'Arc
Hi Nathan, On Thu, May 30, 2013 at 6:41 PM, Nathan Hurst <njh@njhurst.com> wrote:
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm:
I believe that you're right on one point and wrong on another. You're right in that this gives a faster algo for str(). You're wrong in that it's still quadratic. If 'a' has 2N digits and 'b' has N digits, then divmod(a,b) is quadratic --- takes time proportional to N*N. It can be shown by measuring the time spent by your algo to do the repr of larger and larger numbers.
beating the builtin C implementation by a factor of 7.5 seems a reasonable outcome for pypy.
No, precisely my point: this argument is bogus. The proof that it's wrong is that CPython gets very similar timing results! Your pure Python version outperforms the C str(long) in a very similar way on PyPy and on CPython! The "bug" is originally in CPython, for having a str() that is too slow, and I just copied it into PyPy. The pure Python version you posted is faster. Its speed is roughly the same on CPython and on PyPy because most of the time is spent doing divmod on large "long" objects (which is this post's original point).
I think I could come up with a linear time two pass algorithm working on intdigits if this were important to pypy.
That would be interesting for both PyPy and CPython. A bientôt, Armin.
Hi Armin, I have only glanced at the code, but isn't the right argument of the divmod always a power of two? So it can be replaced by a shift and a mask, giving the right complexity. Cheers, Carl Friedrich Armin Rigo <arigo@tunes.org> wrote:
Hi Nathan,
On Thu, May 30, 2013 at 6:41 PM, Nathan Hurst <njh@njhurst.com> wrote:
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm:
I believe that you're right on one point and wrong on another. You're right in that this gives a faster algo for str(). You're wrong in that it's still quadratic. If 'a' has 2N digits and 'b' has N digits, then divmod(a,b) is quadratic --- takes time proportional to N*N. It can be shown by measuring the time spent by your algo to do the repr of larger and larger numbers.
beating the builtin C implementation by a factor of 7.5 seems a reasonable outcome for pypy.
No, precisely my point: this argument is bogus. The proof that it's wrong is that CPython gets very similar timing results! Your pure Python version outperforms the C str(long) in a very similar way on PyPy and on CPython! The "bug" is originally in CPython, for having a str() that is too slow, and I just copied it into PyPy. The pure Python version you posted is faster. Its speed is roughly the same on CPython and on PyPy because most of the time is spent doing divmod on large "long" objects (which is this post's original point).
I think I could come up with a linear time two pass algorithm working on intdigits if this were important to pypy.
That would be interesting for both PyPy and CPython.
A bientôt,
Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
On 05/30/2013 06:41 PM Nathan Hurst wrote:
On Thu, May 30, 2013 at 10:23:17AM +0200, Armin Rigo wrote:
Hi all,
Some people learn about PyPy, and the first program they try to measure speed with is something like this:
def factorial(n): res = 1 for i in range(1, n + 1): res *= i return res print factorial(25000)
It may not be completely obvious a priori, but this is as bogus as it gets. This is by now only 50% slower in PyPy than in CPython thanks to efforts from various people. The issue is of course that it's an algo which, in CPython or in PyPy, spends most of its time in C code computing with rather large "long" objects. (No, PyPy doesn't contain magic to speed up C code 10 times.) In fact, this program spends more than 2/3rd of its time in the final repr() of the result! Converting a long to base 10 is a quadratic operation.
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm: def reclongtostr(x): if x< 0: return "-"+reclongtostr(-x) x = long(x) # expect a long min_digits = 9 # fits in 32 bits, there may be a better choice for this pts = [10**min_digits] while pts[-1]< x: pts.append(pts[-1]**2) pts.pop() # remove first 10**2**i greater than x output = [] def spl(x,i): if i< 0: # bottomed out with max_digit sized pieces if output or x> 0: s = str(x) output.append("0"*(min_digits - len(s)) + s) # note that this appends in inorder else: top,bot = divmod(x, pts[i]) # split the number spl(top,i-1) spl(bot,i-1) spl(x,len(pts)-1) # strip leading zeros, we can probably do this more elegantly while output[0][0] == "0": output[0] = output[0][1:] return ''.join(output)
which benchmarks factorial(25000) like this:
import time s = time.time() x = factorial(25000) print "factorial", time.time() - s sx = str(x) # give pypy a chance to compile s = time.time() sx = str(x) print "Str time", time.time() - s rsx = reclongtostr(x) # give pypy a chance to compile s = time.time() rsx = reclongtostr(x) print "my rec time", time.time() - s print "equal string:", sx == rsx
factorial 0.182402133942 Str time 0.505062818527 my rec time 0.0678248405457 equal string: True
I'm sure a better programmer than I could make this faster by avoiding saving intermediate results and various micro optimisations. But beating the builtin C implementation by a factor of 7.5 seems a reasonable outcome for pypy.
I think I could come up with a linear time two pass algorithm working on intdigits if this were important to pypy.
Does it still make sense to add programs like this to our benchmarks? So far, our benchmarks are "real-life" examples. The benchmarks like above are completely missing the point of PyPy, as they don't stress at all the Python interpreter part. There are also other cases where PyPy's performance is very bad, like cpyext on an extension module with lots of small C API calls. I believe that it would still make sense to list such cases in the official benchmark, and have the descriptions of the benchmarks explain what's wrong with them.
I agree that you should include them, I disagree that they are 'wrong'. They measure the overhead of a C call. Why should a C call be slower in pypy than cpython? Presumably it could be compiled down to the appropriate instructions and then out-perform cpy.
Now that the topic of benchmarks has come up, I came across this benchmark recently: http://dalkescientific.com/writings/diary/archive/2009/11/15/100000_tasklets...
The same benchmark took 8.5s on pypy 2beta2 and takes 7.5s on pypy 2.0.1. Is there any obvious reasons why pypy's tasklets are so slow to switch? (Is it the scheduler?) This is important for my adoption of pypy at work.
njh
[14:49 ~/wk/py/clp]$ pypy
I remember doing something similar to convert long to string, way back in .. let's see .. googling google comp.lang.python .. oy, 2001, it's been a while ;-) https://groups.google.com/forum/?hl=sv&fromgroups#!searchin/comp.lang.python/richter$20strL.py/comp.lang.python/6HYHojX7ZlA/Wizytwby71QJ (from top, search to first instance of strL.py) I guess it's not too long to post a copy here: .. (Hm, I see I should have imported time from time instead of clock, since the latter has just scheduler tick resolution, and time is high resolution.) ____________________________________________________ A couple of lines wrapped... _______________________________________________________________ # strL.py -- recursive long to decimal string conversion # 2001-07-10 bokr # p10d={} def p10(n): if not p10d.has_key(n): p = p10d[n] = 10L**n return p return p10d[n] def strLR(n,w=0): # w>0 is known width, w<0 is searching guess if w == 0: return [] if w > 0: if w <=9: return [('%0'+chr(w+48)+'d') % n] wr = w/2 nl,nr = divmod(n,p10(wr)) return strLR(nl,w-wr)+strLR(nr,wr) else: nl,nr = divmod(n,p10(-w)) if nl: return strLR(nl, 2*w) + strLR(nr,-w) else: if w >= -9: return ['%d' % n] else: return strLR(nr,w/2) def strL(n): if n<0: return ''.join(['-']+strLR(-n,-9)) else: return ''.join(strLR(n,-9)) from time import clock import sys def main(): def pr(x): print x def psl(x): s = strL(x) sys.stdout.write(s) sys.stdout.write('\n') dt={'str':str,'strL':strL,'repr':repr, 'print':pr, 'printStrL':psl } try: x=long( eval(sys.argv[2]) ) fn=sys.argv[1] fcn=dt[fn] except: sys.stderr.write("usage: %s [str strL repr print printStrL] <const expr>\n" % sys.argv[0]) sys.exit(2) t0=clock() fcn(x) t1=clock() print "%s took %9.6f seconds" % (fn,t1-t0) if __name__ == "__main__": main() ____________________________________________________ Got curious, so I added to your benchmark: from strL import strL ## bokr mod and ## bokr mod: add timing of strL, like above rsb = strL(x) # give pypy a chance to compile s = time.time() rsx = strL(x) print "strL rec time", time.time() - s print "equal string:", sx == rsb then ran it (with old python and pypy on old laptop: Version info: [14:49 ~/wk/py/clp]$ python Python 2.7.2 (default, Jul 8 2011, 23:38:53) [GCC 4.1.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. pypy: /usr/lib/libcrypto.so.0.9.8: no version information available (required by pypy) pypy: /usr/lib/libssl.so.0.9.8: no version information available (required by pypy) Python 2.7.1 (b590cf6de419, Apr 30 2011, 02:00:38) [PyPy 1.5.0-alpha0 with GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. And now for something completely different: ``nothing is true''
The bench run: [14:49 ~/wk/py/clp]$ python NathanHurst.py factorial 1.36516094208 Str time 2.93479013443 my rec time 1.45956683159 equal string: True strL rec time 1.34501504898 equal string: True [14:50 ~/wk/py/clp]$ pypy NathanHurst.py pypy: /usr/lib/libcrypto.so.0.9.8: no version information available (required by pypy) pypy: /usr/lib/libssl.so.0.9.8: no version information available (required by pypy) factorial 2.29024791718 Str time 3.14243102074 my rec time 1.25054502487 equal string: True strL rec time 1.12671113014 equal string: True [14:50 ~/wk/py/clp]$ ... Hm, wonder why your factorial was slower on pypy .. prob due to old version? Regards, Bengt Richter
Hi Carl Friedrich, On Fri, May 31, 2013 at 12:01 PM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:
I have only glanced at the code, but isn't the right argument of the divmod always a power of two? So it can be replaced by a shift and a mask, giving the right complexity.
It's a power of 10. A bientôt, Armin.
Hi Bengt, On Fri, May 31, 2013 at 3:01 PM, Bengt Richter <bokr@oz.net> wrote:
[PyPy 1.5.0-alpha0 with GCC 4.4.3] on linux2
... Hm, wonder why your factorial was slower on pypy .. prob due to old version?
Another benchmark that completely misses the point, yay! This one shows that we improved quite a bit since PyPy 1.5.0-alpha0, which was literally ages ago. A bientôt, Armin.
On Wed, Jun 26, 2013 at 09:06:01AM +0200, Carl Friedrich Bolz wrote:
On 30.05.2013 18:41, Nathan Hurst wrote:
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm:
FWIW, I started turning this code into RPython code on the faster-str-of-bigint branch.
Do you have a github gist or similar? njh
On 29.06.2013 16:19, Nathan Hurst wrote:
On Wed, Jun 26, 2013 at 09:06:01AM +0200, Carl Friedrich Bolz wrote:
On 30.05.2013 18:41, Nathan Hurst wrote:
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm:
FWIW, I started turning this code into RPython code on the faster-str-of-bigint branch.
Do you have a github gist or similar?
I already merged it to the main branch: https://bitbucket.org/pypy/pypy/commits/902241cca7dc4cb76bf65732c3d134543fb4... Thanks very much :-). The main changes I did from your version was generalize it to work for any 3 <= n <= 36, optimize the handling of leading zeros and use a StringBuilder instead of a list to build the resulting string. Plus I kept the already existing fast version for powers of two alive. Cheers, Carl Friedrich
On Sat, Jun 29, 2013 at 04:27:02PM +0200, Carl Friedrich Bolz wrote:
On 29.06.2013 16:19, Nathan Hurst wrote:
On Wed, Jun 26, 2013 at 09:06:01AM +0200, Carl Friedrich Bolz wrote:
On 30.05.2013 18:41, Nathan Hurst wrote:
It doesn't have to be quadratic, it's easy to come up with a splitting algorithm:
FWIW, I started turning this code into RPython code on the faster-str-of-bigint branch.
I don't actually see any difference between that code and normal python. What makes it rpython?
Do you have a github gist or similar?
I already merged it to the main branch:
https://bitbucket.org/pypy/pypy/commits/902241cca7dc4cb76bf65732c3d134543fb4...
Thanks very much :-).
The main changes I did from your version was generalize it to work for any 3 <= n <= 36, optimize the handling of leading zeros and use a StringBuilder instead of a list to build the resulting string. Plus I kept the already existing fast version for powers of two alive.
All good decisions. Is there a nice way to profile this with something like line_profiler? It's possible to improve the asymptotics in various places, but they almost certainly aren't worth it. Of course the real question is whether pypy now beats python2.7 on the all important factorial benchmark. (And if so should we rush out a new release? ;) njh
On 29/06/13 20:06, Nathan Hurst wrote:
I don't actually see any difference between that code and normal python. What makes it rpython?
Being RPython is mostly a contextual property. It does nothing "too dynamic", so it is RPython.
The main changes I did from your version was generalize it to work for any 3 <= n <= 36, optimize the handling of leading zeros and use a StringBuilder instead of a list to build the resulting string. Plus I kept the already existing fast version for powers of two alive.
All good decisions. Is there a nice way to profile this with something like line_profiler? It's possible to improve the asymptotics in various places, but they almost certainly aren't worth it.
For rbigint I just profiled the generated C source.
Of course the real question is whether pypy now beats python2.7 on the all important factorial benchmark. (And if so should we rush out a new release? ;)
Yes, it does, and it will be in the 2.1 release. I also did some additional changes yesterday: https://bitbucket.org/pypy/pypy/commits/7ceb58dbdc244439b58d8bcc2d5e41a35fa3... (and following commits) I now cache the powers of base needed to split the number, because the profile was completely dominated by calls to pow. In addition, I compute the minimum digits based on the base, instead of using a pessimistic bound for all bases. Now the profile is actually dominated by calls to divmod, which is probably expected. Cheers, Carl Friedrich
participants (6)
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Bengt Richter
-
Carl Friedrich Bolz
-
Eleytherios Stamatogiannakis
-
Nathan Hurst