Real-world Python code 700 times slower than C

Sat Jan 5 12:04:19 EST 2002

By cheating a bit I came up with a version using Numeric that runs more than
50x faster than the original Python version. I didn't compare with the C
version directly, but this seems like it should put things back in the
10-15x slower range. I'm cheating because I use an array of floats instead
of doubles as the C code does, but it seems that this would be sufficient
for a color picker. Not sure if this is useful to anyone, but it was sort of
entertaining.

-tim

def Ramp1(result, size, start, end):
    step = (end-start)/(size-1)
    for i in xrange(size):
        result[i] = start + step*i

import Numeric as np
_counter = np.arange(100).astype('f')

def Ramp2(result, size, start, end):
    global _counter
    try:
        result[:] = _counter[:size]
    except ValueError:
        _counter = np.arange(len(result)).astype('f')
        result[:] = _counter[:size]
    result *= (end-start)/(size-1)
    result += start

def main():
    import time
    array = [0.0]*10000
    array = np.array(array, 'f', savespace=1)
    size, start, end = 10000, 0.0, 1.0
    r = range(1000)
    t0 = time.clock()
    for i in r:
        pass
    t1 = time.clock()
    for i in r:
        Ramp1(array, size, start, end)
    t2 = time.clock()
    for i in r:
        Ramp2(array, size, start, end)
    t3 = time.clock()
    print (t2 - t1) - (t1 - t0),
    print (t3 - t2) - (t1 - t0),

main()

"Brent Burley" <brent.burley at disney.com> wrote in message
news:e2942cb3.0201041647.20271a23 at posting.google.com...
> I often use a "10x" rule of thumb for comparing Python to C, but I
> recently hit one real-world case where Python is almost 700 times
> slower than C!  We just rewrote the routine in C and moved on, but
> this has interesting implications for Python optimization efforts.
>
> python
> ------
> def Ramp(result, size, start, end):
>     step = (end-start)/(size-1)
>     for i in xrange(size):
>         result[i] = start + step*i
>
> def main():
>     array = [0]*10000
>     for i in xrange(100):
>         Ramp(array, 10000, 0.0, 1.0)
>
> main()
>
>
> c version
> ---------
> void Ramp(double* result, int size, double start, double end)
> {
>     double step = (end-start)/(size-1);
>     int i;
>     for (i = 0; i < size; i++)
> *result++ = start + step*i;
> }
>
> void main()
> {
>     double array[10000];
>     int i;
>     for (i = 0; i < 100000; i++)
> Ramp(array, 10000, 0.0, 1.0);
> }
>
> We use a Ramp function similar to this to generate rgb swatches for a
> color picker.  There are many, possibly large color swatches that are
> updated on every mouse event and performance was unacceptable in the
> pure Python version.  There are also 2d and circular swatches that
> would be even worse if coded in Python.
>
> The Python version runs in 7.7 seconds.  The C version runs in 11.3
> seconds, but loops 1000 times at much.  The ratio is therefore
> 7.7*1000/11.3 = 681 (or 68100%).
>
> As expected, 99.9% of the time is spent in eval_code2, the main
> interpreter loop.  Within the loop, the profile is:
>
> --- General loop overhead ---
> switch(opcode)    .66 sec
> HAS_ARG           .48 sec
> tstate->ticker    .42 sec
> NEXTARG           .30 sec
> NEXTOP            .15 sec
>
> --- Individual opcodes ---
> FOR_LOOP:        1.59 sec
> BINARY_MULTIPLY: 1.02 sec
> LOAD_FAST:        .99 sec
> BINARY_ADD:       .96 sec
> STORE_SUBSCR:     .78 sec
> STORE_FAST:       .15 sec
> SET_LINENO:       .09 sec
> JUMP_ABSOLUTE:    .06 sec
>
> For comparison, consider that the entire equivalent C program runs in
> .01 sec (when you equalize the number of iterations).  That means that
> just running the switch(opcode) statement takes 66 times as long as
> all the C code.
>
> All the proposals I've seen for Python optimization are aimed at
> general speedups.  That's fine, but a 50% (or even 90%) speedup won't
> help much when your code is 500 times slower than it needs to be.  I
> don't think that even JIT native code compilation will help much in
> this case because of Python's dynamic nature.
>
> I like the approach that the Perl Inline module takes where you can
> put C code directly inline with your Perl code and the Inline module
> compiles and caches the C code automatically.  However the fact that
> it's C (with all of its safety and portability problems) and the fact
> that it relies on a C compiler to be properly installed and accessible
> make this approach unappealing for general use.
>
> What I really want is something spiritually equivalent to a portable
> inline assembly language with python-ish syntax that generates really
> fast native code and seamlessly integrates with python.  I can dream
> can't I?
>
> --
>
> As an aside, there's another interesting bottleneck we hit in our
> production code.  We're reading a lookup table from a text file (for
> doing image display color correction) that consists of 64K lines with
> 3 integers on each line.  The python code looks something like:
>
> rArray = []
> gArray = []
> bArray = []
> for line in open(lutPath).xreadlines():
>     entry = split(line)
>     rArray.append(int(entry[0]))
>     gArray.append(int(entry[1]))
>     bArray.append(int(entry[2]))
>
> There are all kinds of ways to optimize this a little bit, but there
> doesn't seem to be a way to make it acceptably fast.
> map(int,open(path).read().split()) gets you pretty close, but
> deinterleaving is still slow.  The C version ended up being several
> hundred times faster.
>
> Brent Burley