Terrible FPU performance

Dan Stromberg drsalists at gmail.com
Tue Apr 26 15:27:57 EDT 2011


On Tue, Apr 26, 2011 at 6:40 AM, Mihai Badoiu <mbadoiu at gmail.com> wrote:
> Hi,
> I have terrible performance for multiplication when one number gets very
> close to zero.  I'm using cython by writing the following code:
>     cdef int i
>     cdef double x = 1.0
>     for 0 <= i < 10000000:
>         x *= 0.8
>         #x += 0.01
>     print x
> This code runs much much slower (20+ times slower) with the line x += 0.01
> uncommented.  I looked at the deassembled code and it looks correct.
>  Moreover, it's just a few lines and by writing a C code (without python on
> top), I get the same code, but it's much faster.  I've also tried using sse,
> but I get exactly the same behavior.  The best candidate that I see so far
> is that Python sets up the FPU in a different state than C.
> Any advice on how to solve this performance problem?
> thanks!

I'm getting almost the opposite result: with 1 operation (just the
*=), it's about 6 times slower than with two operations (with the *=
and +=).

I investigated whether it was a GC issue; I conclude it's probably not
- the number of objects in the heap was consistent between the fast
and slow cython versions.

I also split out the addition and multiplication into their own
functions, foo1 and foo2, and profiled what was left of the original
function, foo.  This did not tell anything salient either, other than
that the foo function had some hidden performance penalty.  A line by
line profiler might help, if Cython supports going down to that level;
it seems that function by function profiling may not be granular
enough for this problem.

I found that the performance difference wasn't as marked as on your
system - more like 6x.

I also found that the duration of the faster of the two cython's was
almost exactly the same as the duration of the C program I created to
do the same thing.

BTW, what version of CPython, and what version of Cython are you
using?  I used mostly CPython 3.1, with a Cython compiled from an
early version that supported generators - which I believe has now been
merged.  You might try a few different versions of Cython and compare
the results.

What OS are you on?  What kind of CPU?  I'm on Ubuntu 10.10 with an
AMD Athlon(tm) II X2 245 Processor.

It might reveal something if you hand-coded a C extension module that
does the same thing.  And while this may be slow, it might also be
revealing to try using decimal.Decimal arithmetic.

Interesting that you get a semi-similar result in pure CPython.

Please find below a shar of my experiments.

#!/bin/sh
# This is a shell archive (produced by GNU sharutils 4.9).
# To extract the files from this archive, save it to some FILE, remove
# everything before the `#!/bin/sh' line above, then type `sh FILE'.
#
lock_dir=_sh26876
# Made on 2011-04-26 12:08 PDT by <dstromberg at benchbox>.
# Source directory was `/home/dstromberg/src/cython-slowdown'.
#
# Existing files will *not* be overwritten, unless `-c' is specified.
#
# This shar contains:
# length mode       name
# ------ ---------- ------------------------------------------
#    463 -rw-r--r-- cs.m4
#    324 -rwxr-xr-x main
#    787 -rw-r--r-- Makefile
#   1954 -rw-r--r-- output
#    180 -rw-r--r-- t.c
#
MD5SUM=${MD5SUM-md5sum}
f=`${MD5SUM} --version | egrep '^md5sum .*(core|text)utils'`
test -n "${f}" && md5check=true || md5check=false
${md5check} || \
  echo 'Note: not verifying md5sums.  Consider installing GNU coreutils.'
save_IFS="${IFS}"
IFS="${IFS}:"
gettext_dir=FAILED
locale_dir=FAILED
first_param="$1"
for dir in $PATH
do
  if test "$gettext_dir" = FAILED && test -f $dir/gettext \
     && ($dir/gettext --version >/dev/null 2>&1)
  then
    case `$dir/gettext --version 2>&1 | sed 1q` in
      *GNU*) gettext_dir=$dir ;;
    esac
  fi
  if test "$locale_dir" = FAILED && test -f $dir/shar \
     && ($dir/shar --print-text-domain-dir >/dev/null 2>&1)
  then
    locale_dir=`$dir/shar --print-text-domain-dir`
  fi
done
IFS="$save_IFS"
if test "$locale_dir" = FAILED || test "$gettext_dir" = FAILED
then
  echo=echo
else
  TEXTDOMAINDIR=$locale_dir
  export TEXTDOMAINDIR
  TEXTDOMAIN=sharutils
  export TEXTDOMAIN
  echo="$gettext_dir/gettext -s"
fi
if (echo "testing\c"; echo 1,2,3) | grep c >/dev/null
then if (echo -n test; echo 1,2,3) | grep n >/dev/null
     then shar_n= shar_c='
'
     else shar_n=-n shar_c= ; fi
else shar_n= shar_c='\c' ; fi
f=shar-touch.$$
st1=200112312359.59
st2=123123592001.59
st2tr=123123592001.5 # old SysV 14-char limit
st3=1231235901

if touch -am -t ${st1} ${f} >/dev/null 2>&1 && \
   test ! -f ${st1} && test -f ${f}; then
  shar_touch='touch -am -t $1$2$3$4$5$6.$7 "$8"'

elif touch -am ${st2} ${f} >/dev/null 2>&1 && \
   test ! -f ${st2} && test ! -f ${st2tr} && test -f ${f}; then
  shar_touch='touch -am $3$4$5$6$1$2.$7 "$8"'

elif touch -am ${st3} ${f} >/dev/null 2>&1 && \
   test ! -f ${st3} && test -f ${f}; then
  shar_touch='touch -am $3$4$5$6$2 "$8"'

else
  shar_touch=:
  echo
  ${echo} 'WARNING: not restoring timestamps.  Consider getting and
installing GNU `touch'\'', distributed in GNU coreutils...'
  echo
fi
rm -f ${st1} ${st2} ${st2tr} ${st3} ${f}
#
if test ! -d ${lock_dir} ; then :
else ${echo} "lock directory ${lock_dir} exists"
     exit 1
fi
if mkdir ${lock_dir}
then ${echo} "x - created lock directory ${lock_dir}."
else ${echo} "x - failed to create lock directory ${lock_dir}."
     exit 1
fi
# ============= cs.m4 ==============
if test -f 'cs.m4' && test "$first_param" != -c; then
${echo} "x - SKIPPING cs.m4 (file already exists)"
else
${echo} "x - extracting cs.m4 (text)"
  sed 's/^X//' << 'SHAR_EOF' > 'cs.m4' &&
X
# cython: profile=True
X
import gc
X
cpdef foo():
X	cdef int i
X	cdef double x = 1.0
X	#for 0 <= i < 10000000:
X	for 0 <= i < 100000000:
#		if i % 10000000 == 0:
#			print(`len'(gc.get_objects()))
X		x *= 0.8
X		# This code runs much much slower (20+ times slower) with the line
X		# x += 0.01 uncommented.  IOW, with 1 operation, it's 20 times faster;
X		# with 2 operations, it's slower than it should be.
X		ifdef(`commented',`# ')x += 0.01
X		#x += 0.0
X	#print x
X
X
SHAR_EOF
  (set 20 11 04 26 12 07 22 'cs.m4'
   eval "${shar_touch}") && \
  chmod 0644 'cs.m4'
if test $? -ne 0
then ${echo} "restore of cs.m4 failed"
fi
  if ${md5check}
  then (
       ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'cs.m4': 'MD5 check failed'
       ) << \SHAR_EOF
56c23edc97b77a178c2fdf50399041c2  cs.m4
SHAR_EOF
  else
test `LC_ALL=C wc -c < 'cs.m4'` -ne 463 && \
  ${echo} "restoration warning:  size of 'cs.m4' is not 463"
  fi
fi
# ============= main ==============
if test -f 'main' && test "$first_param" != -c; then
${echo} "x - SKIPPING main (file already exists)"
else
${echo} "x - extracting main (text)"
  sed 's/^X//' << 'SHAR_EOF' > 'main' &&
#!/usr/bin/python
X
import sys
import pstats
import cProfile
X
if sys.argv[1:] == [ 'cs1' ]:
X	import cs1 as cs
elif sys.argv[1:] == [ 'cs2' ]:
X	import cs2 as cs
else:
X	raise ValueError
X
cProfile.runctx("cs.foo()", globals(), locals(), "foo.prof")
X
s = pstats.Stats("foo.prof")
s.strip_dirs().sort_stats("time").print_stats()
X
SHAR_EOF
  (set 20 11 04 26 11 22 14 'main'
   eval "${shar_touch}") && \
  chmod 0755 'main'
if test $? -ne 0
then ${echo} "restore of main failed"
fi
  if ${md5check}
  then (
       ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'main': 'MD5 check failed'
       ) << \SHAR_EOF
0c18c9a91e04269b39fb115f37b375e2  main
SHAR_EOF
  else
test `LC_ALL=C wc -c < 'main'` -ne 324 && \
  ${echo} "restoration warning:  size of 'main' is not 324"
  fi
fi
# ============= Makefile ==============
if test -f 'Makefile' && test "$first_param" != -c; then
${echo} "x - SKIPPING Makefile (file already exists)"
else
${echo} "x - extracting Makefile (text)"
  sed 's/^X//' << 'SHAR_EOF' > 'Makefile' &&
X
pyver=python-3.1
short_pyver=$(shell echo "$(pyver)" | sed 's/-//')
python_dir=/usr/local/c$(pyver)
X
go: cs1.so cs2.so t
X	time $(python_dir)/bin/python ./main cs1
X	time $(python_dir)/bin/python ./main cs2
X	time ./t
X
t: t.c
X	gcc -o t t.c
X
cs1.pyx cs2.pyx: cs.m4
X	m4 -Dcommented=1 < cs.m4 > cs1.pyx
X	m4 cs.m4 > cs2.pyx
X	$(python_dir)/bin/cython -a cs1.pyx
X	$(python_dir)/bin/cython -a cs2.pyx
X
cs1.so: cs1.pyx
X	$(python_dir)/bin/cython cs1.pyx
X	gcc -c cs1.c -I$(python_dir)/include/$(short_pyver)
X	gcc -o cs1.so -shared cs1.o -L$(python_dir)/lib -l$(short_pyver)
X
cs2.so: cs2.pyx
X	$(python_dir)/bin/cython cs2.pyx
X	gcc -c cs2.c -I$(python_dir)/include/$(short_pyver)
X	gcc -o cs2.so -shared cs2.o -L$(python_dir)/lib -l$(short_pyver)
X
clean:
X	rm -f cs[12].c *.o *.so *.pyc t *.pyx *.prof
X
SHAR_EOF
  (set 20 11 04 26 12 04 13 'Makefile'
   eval "${shar_touch}") && \
  chmod 0644 'Makefile'
if test $? -ne 0
then ${echo} "restore of Makefile failed"
fi
  if ${md5check}
  then (
       ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'Makefile': 'MD5 check failed'
       ) << \SHAR_EOF
dbfc58948b66668e1c3a3efc90d84dac  Makefile
SHAR_EOF
  else
test `LC_ALL=C wc -c < 'Makefile'` -ne 787 && \
  ${echo} "restoration warning:  size of 'Makefile' is not 787"
  fi
fi
# ============= output ==============
if test -f 'output' && test "$first_param" != -c; then
${echo} "x - SKIPPING output (file already exists)"
else
${echo} "x - extracting output (text)"
  sed 's/^X//' << 'SHAR_EOF' > 'output' &&
/usr/local/cpython-3.1/bin/cython cs1.pyx
gcc -c cs1.c -I/usr/local/cpython-3.1/include/python3.1
gcc -o cs1.so -shared cs1.o -L/usr/local/cpython-3.1/lib -lpython3.1
/usr/local/cpython-3.1/bin/cython cs2.pyx
gcc -c cs2.c -I/usr/local/cpython-3.1/include/python3.1
gcc -o cs2.so -shared cs2.o -L/usr/local/cpython-3.1/lib -lpython3.1
time /usr/local/cpython-3.1/bin/python ./main cs1
Tue Apr 26 12:01:15 2011    foo.prof
X
X         5 function calls in 6.495 CPU seconds
X
X   Ordered by: internal time
X
X   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
X        1    6.495    6.495    6.495    6.495 cs1.pyx:6(foo)
X        1    0.000    0.000    6.495    6.495 {built-in method exec}
X        1    0.000    0.000    6.495    6.495 <string>:1(<module>)
X        1    0.000    0.000    6.495    6.495 {built-in method foo}
X        1    0.000    0.000    0.000    0.000 {method 'disable' of
'_lsprof.Profiler' objects}
X
X
5.92user 0.02system 0:06.51elapsed 91%CPU (0avgtext+0avgdata 23568maxresident)k
0inputs+8outputs (0major+1608minor)pagefaults 0swaps
time /usr/local/cpython-3.1/bin/python ./main cs2
0.05
Tue Apr 26 12:01:16 2011    foo.prof
X
X         5 function calls in 0.987 CPU seconds
X
X   Ordered by: internal time
X
X   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
X        1    0.987    0.987    0.987    0.987 cs2.pyx:6(foo)
X        1    0.000    0.000    0.987    0.987 {built-in method exec}
X        1    0.000    0.000    0.987    0.987 <string>:1(<module>)
X        1    0.000    0.000    0.987    0.987 {built-in method foo}
X        1    0.000    0.000    0.000    0.000 {method 'disable' of
'_lsprof.Profiler' objects}
X
X
1.00user 0.00system 0:01.01elapsed 99%CPU (0avgtext+0avgdata 23584maxresident)k
0inputs+8outputs (0major+1609minor)pagefaults 0swaps
time ./t
0.050000
0.87user 0.00system 0:00.87elapsed 99%CPU (0avgtext+0avgdata 1616maxresident)k
0inputs+0outputs (0major+140minor)pagefaults 0swaps
SHAR_EOF
  (set 20 11 04 26 12 01 17 'output'
   eval "${shar_touch}") && \
  chmod 0644 'output'
if test $? -ne 0
then ${echo} "restore of output failed"
fi
  if ${md5check}
  then (
       ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'output': 'MD5 check failed'
       ) << \SHAR_EOF
c024cb5c585e47771f068d58396f5f17  output
SHAR_EOF
  else
test `LC_ALL=C wc -c < 'output'` -ne 1954 && \
  ${echo} "restoration warning:  size of 'output' is not 1954"
  fi
fi
# ============= t.c ==============
if test -f 't.c' && test "$first_param" != -c; then
${echo} "x - SKIPPING t.c (file already exists)"
else
${echo} "x - extracting t.c (text)"
  sed 's/^X//' << 'SHAR_EOF' > 't.c' &&
X
#include <stdio.h>
#include <stdlib.h>
X
void main()
X	{
X	int i;
X	double x = 1.0;
X	for (i=0; i < 100000000; i++)
X		{
X		x *= 0.8;
X		x += 0.01;
X		}
X	printf("%f\n", x);
X	exit(0);
X	}
X
SHAR_EOF
  (set 20 11 04 26 11 46 49 't.c'
   eval "${shar_touch}") && \
  chmod 0644 't.c'
if test $? -ne 0
then ${echo} "restore of t.c failed"
fi
  if ${md5check}
  then (
       ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 't.c': 'MD5 check failed'
       ) << \SHAR_EOF
ebeb48af881c2309d26b0da6a8b9256f  t.c
SHAR_EOF
  else
test `LC_ALL=C wc -c < 't.c'` -ne 180 && \
  ${echo} "restoration warning:  size of 't.c' is not 180"
  fi
fi
if rm -fr ${lock_dir}
then ${echo} "x - removed lock directory ${lock_dir}."
else ${echo} "x - failed to remove lock directory ${lock_dir}."
     exit 1
fi
exit 0



More information about the Python-list mailing list