Python speed vs csharp
Bengt Richter
bokr at oz.net
Fri Aug 1 05:11:35 CEST 2003
On Wed, 30 Jul 2003 23:09:22 0700, Mike <mike at nospam.com> wrote:
>Bear with me: this post is moderately long, but I hope it is relatively
>succinct.
>
>I've been using Python for several years as a behavioral modeling tool for
>the circuits I design. So far, it's been a good tradeoff: compiled C++
>would run faster, but the development time of Python is so much faster, and
>the resulting code is so much more reliable after the first pass, that I've
>never been tempted to return to C++. Every time I think stupid thoughts
>like, "I'll bet I could do this in C++," I get out my copy of Scott Meyers'
>"Effecive C++," and I'm quickly reminded why it's better to stick with
>Python (Meyers is a very good author, but points out lots of quirks and
>pitfalls with C++ that I keep thinking that I shouldn't have to worry
>about, much less try to remember). Even though Python is wonderful in that
>regard, there are problems.
>
>Here's the chunk of code that I'm spending most of my time executing:
>
># Rational approximation for erfc(x) (Abramowitz & Stegun, Sec. 7.1.26)
># Fifth order approximation. error <= 1.5e7 for all x
>#
>def erfc( x ):
> p = 0.3275911
> a1 = 0.254829592
> a2 = 0.284496736
> a3 = 1.421413741
> a4 = 1.453152027
> a5 = 1.061405429
>
> t = 1.0 / (1.0 + p*float(x))
> erfcx = ( (a1 + (a2 + (a3 +
> (a4 + a5*t)*t)*t)*t)*t ) * math.exp((x**2))
> return erfcx
>
>This is an error function approximation, which gets called around 1.5
>billion times during the simulation, and takes around 3500 seconds (just
>under an hour) to complete. While trying to speed things up, I created a
>simple test case with the code above and a main function to call it 10
>million times. The code takes roughly 210 seconds to run.
I recoded the above and also inlined the code, and also made a C DLL module version
and timed them roughly:
====< erfcopt.py >=====================================================
import math
def erfc_old( x ):
p = 0.3275911
a1 = 0.254829592
a2 = 0.284496736
a3 = 1.421413741
a4 = 1.453152027
a5 = 1.061405429
t = 1.0 / (1.0 + p*float(x))
erfcx = ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * math.exp((x**2))
return erfcx
# You know that all those "constant" assignments are
# actually executed every time erfc is called,
# right? To move that to deftime instead of
# calltime, you can write
def erfc( x, # constants bound at def time, rparen moved down, commas added ):
p = 0.3275911,
a1 = 0.254829592,
a2 = 0.284496736,
a3 = 1.421413741,
a4 = 1.453152027,
a5 = 1.061405429,
exp = math.exp
): # just to make it stand out
t = 1.0 / (1.0 + p*x) #XXX# don't need float(), since p is float
return ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * exp(x**2) #XXX# elim attr lookup on global math.exp
# elim intermediate store of erfcx #XXX# return erfcx
def test():
from time import clock
from erfc import erfc as erfc_in_c
# sanity check
for x in range(20): assert erfc_old(x) == erfc(x) and abs(erfc_in_c(x)erfc_old(x))<1e18
t1=clock()
for i in xrange(100000): erfc_old(1.2345)
t2=clock()
for i in xrange(100000): erfc(1.2345)
t3=clock()
# inlining code from old version:
# constant assignments hoisted out of loop, of course
p = 0.3275911
a1 = 0.254829592
a2 = 0.284496736
a3 = 1.421413741
a4 = 1.453152027
a5 = 1.061405429
x = 1.2345
for i in xrange(100000):
t = 1.0 / (1.0 + p*float(x))
erfcx = ( (a1 + (a2 + (a3 +
(a4 + a5*t)*t)*t)*t)*t ) * math.exp((x**2))
t4 = clock()
for i in xrange(100000): erfc_in_c(1.2345)
t5=clock()
print 'old: %f, new: %f, inline: %f, in_c %f' %(t2t1, t3t2, t4t3, t5t4)
import dis
print ' erfc_old code '
dis.dis(erfc_old)
print ' erfc '
dis.dis(erfc)
if __name__== '__main__':
test()
=======================================================================
Result is:
[19:48] C:\pywk\cstuff>erfcopt.py
old: 4.490975, new: 3.123005, inline: 3.051674, in_c 0.806907
I was surprised that the inline gain was not more, but I guess there was enough computation
to obscure that cost. Anyway, moving some of the work to deftime paid off about 30%. But
going to C cut 82%.
I added the C version to the sanity check, and had to allow an epsilon. I suspect that this
is because the C version probably keeps the whole problem in the FPU with 64 fractional bits
of precision until returning the final double (which has 53 bits), wherease the Python
algorithm presumably stores all intermediate values in memory with the 53bit precision,
and so loses in the overall. That's my theory, anyway.
To see the extra work the old code is doing vs the new, you can see below:
 erfc_old code 
0 SET_LINENO 2
3 SET_LINENO 3
6 LOAD_CONST 1 (0.32759110000000002)
9 STORE_FAST 3 (p)
12 SET_LINENO 4
15 LOAD_CONST 2 (0.25482959199999999)
18 STORE_FAST 2 (a1)
21 SET_LINENO 5
24 LOAD_CONST 3 (0.28449673599999997)
27 STORE_FAST 5 (a2)
30 SET_LINENO 6
33 LOAD_CONST 4 (1.4214137410000001)
36 STORE_FAST 4 (a3)
39 SET_LINENO 7
42 LOAD_CONST 5 (1.453152027)
45 STORE_FAST 7 (a4)
48 SET_LINENO 8
51 LOAD_CONST 6 (1.0614054289999999)
54 STORE_FAST 6 (a5)
Everything above here was unnecessary to do at runtime
57 SET_LINENO 10
60 LOAD_CONST 7 (1.0)
63 LOAD_CONST 7 (1.0)
66 LOAD_FAST 3 (p)
69 LOAD_GLOBAL 6 (float)
72 LOAD_FAST 0 (x)
75 CALL_FUNCTION 1
78 BINARY_MULTIPLY
79 BINARY_ADD
80 BINARY_DIVIDE
81 STORE_FAST 8 (t)
84 SET_LINENO 11
87 LOAD_FAST 2 (a1)
90 LOAD_FAST 5 (a2)
93 LOAD_FAST 4 (a3)
96 LOAD_FAST 7 (a4)
99 LOAD_FAST 6 (a5)
102 LOAD_FAST 8 (t)
105 BINARY_MULTIPLY
106 BINARY_ADD
107 LOAD_FAST 8 (t)
110 BINARY_MULTIPLY
111 BINARY_ADD
112 LOAD_FAST 8 (t)
115 BINARY_MULTIPLY
116 BINARY_ADD
117 LOAD_FAST 8 (t)
120 BINARY_MULTIPLY
121 BINARY_ADD
122 LOAD_FAST 8 (t)
125 BINARY_MULTIPLY
126 LOAD_GLOBAL 9 (math) <<+ replaced by a single LOAD_FAST
129 LOAD_ATTR 10 (exp) <<+
132 LOAD_FAST 0 (x)
135 LOAD_CONST 8 (2)
138 BINARY_POWER
139 UNARY_NEGATIVE
140 CALL_FUNCTION 1
143 BINARY_MULTIPLY
144 STORE_FAST 1 (erfcx) <<+

147 SET_LINENO 13 <<+
150 LOAD_FAST 1 (erfcx) <<+ all eliminated
153 RETURN_VALUE
154 LOAD_CONST 0 (None)
157 RETURN_VALUE
 erfc 
0 SET_LINENO 20
3 SET_LINENO 29
6 LOAD_CONST 1 (1.0)
9 LOAD_CONST 1 (1.0)
12 LOAD_FAST 1 (p)
15 LOAD_FAST 0 (x)
18 BINARY_MULTIPLY
19 BINARY_ADD
20 BINARY_DIVIDE
21 STORE_FAST 8 (t)
24 SET_LINENO 30
27 LOAD_FAST 2 (a1)
30 LOAD_FAST 3 (a2)
33 LOAD_FAST 4 (a3)
36 LOAD_FAST 5 (a4)
39 LOAD_FAST 6 (a5)
42 LOAD_FAST 8 (t)
45 BINARY_MULTIPLY
46 BINARY_ADD
47 LOAD_FAST 8 (t)
50 BINARY_MULTIPLY
51 BINARY_ADD
52 LOAD_FAST 8 (t)
55 BINARY_MULTIPLY
56 BINARY_ADD
57 LOAD_FAST 8 (t)
60 BINARY_MULTIPLY
61 BINARY_ADD
62 LOAD_FAST 8 (t)
65 BINARY_MULTIPLY
66 LOAD_FAST 7 (exp) <<
69 LOAD_FAST 0 (x)
72 LOAD_CONST 2 (2)
75 BINARY_POWER
76 UNARY_NEGATIVE
77 CALL_FUNCTION 1
80 BINARY_MULTIPLY
81 RETURN_VALUE <<
82 LOAD_CONST 0 (None)
85 RETURN_VALUE
[19:49] C:\pywk\cstuff>
Regards,
Bengt Richter
More information about the Pythonlist
mailing list