# Speed-up for loops

Alain Ketterlin alain at dpt-info.u-strasbg.fr
Wed Sep 8 15:58:01 CEST 2010

```Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:

>> With Python 3 and def f(x): return x+1, unrolling this loop 4x improved
>> speed by 15%; 4.00 minutes reduces to 3.30 minutes.

> I'm afraid that I can't replicate those figures. In my test, unrolling
> the loop causes a massive SLOWDOWN of 37%, not a speed up. Here is my
> test code:
>
> def f(x):
>     return x+1
>
> def test_loop(n):
>     a = 0
>     for i in range(n):
>         a += f(i)
>     return a
>
> def test_unrolled(n):
>     a = 0
>     for i in range(n//4):
>         a += f(4*i)
>         a += f(4*i+1)
>         a += f(4*i+2)
>         a += f(4*i+3)
>     return a
>
> from timeit import Timer
> n = 10000000
> assert test_loop(n) == test_unrolled(n)
> t1 = Timer('test_loop(n)',
>   'from __main__ import test_loop; n=%d' % n)
> t2 = Timer('test_unrolled(n)',
>   'from __main__ import test_unrolled; n=%d' % n)
>
> And here are the results using Python 3.1. Times are the best of 5 for 10
> calls each to the loop_* functions, results given in seconds:
>
>>>> min(t1.repeat(number=10, repeat=5))/10
> 5.97409288883209
>>>> min(t2.repeat(number=10, repeat=5))/10
> 8.25656900405883

Running this test with python 2.6 (on my laptop) led to:

>>> min(t1.repeat(number=10, repeat=5))/10
2.10715539455
>>> min(t2.repeat(number=10, repeat=5))/10
2.43037149906

That's a 15% slowdown. Which is reasonable since you add four multiplies
in the loop body. Changing your unrolled loop to:

def test_unrolled(n):
a = 0
for i in range(n//4):
b = 4*i
a += f(b)
a += f(b+1)
a += f(b+2)
a += f(b+3)
return a

makes both versions run in approximately the same time (2.135 vs.
2.136).

> That's slightly better (32% slower instead of 37% slower), but still a
> massive performance hit. Given these results, I'm prepared to say that
> loop unrolling in Python is almost certainly going to be a pessimation,
> not an optimization, no matter what you have inside the loop.

I don't really see why it should be the case. Do you have any idea?

I don't think either that it should speed things up significantly. Loop
unrolling in binary code is relevant mostly because it allows better
instruction scheduling (i.e., scheduling has fewer constraints in
longer instruction sequences). Python programs are way too far from
binary code for scheduling opts to apply.

-- Alain.

```