[Python-Dev] acceptability of asm in python code?

Christian Tismer tismer@tismer.com
Sun, 09 Mar 2003 05:16:24 +0100

Tim Peters wrote:
> [Damien Morton]
>>In the BINARY_ADD opcode, and in most arithmetic opcodes,
> Aren't add and subtract the whole story here?
>>there is a line that checks for overflow that looks like this:
>>if ((i^a) < 0 && (i^b) < 0) goto slow_add;
>>I got a small speedup by replacing this with a macro defined thusly:
>>#if defined(_MSC_VER) and defined(_M_IX86)
> "and" isn't C, so I assume you were very lucky <wink>.
>>#define IF_OVERFLOW_GOTO(X) __asm { jo X };
>>#define IF_OVERFLOW_GOTO(X) if ((i^a) < 0 && (i^b) < 0) goto X;
>>Would this case be an acceptable use of snippets of inline assembler?
> If you had said "a huge speedup, on all programs", on the weak end of maybe.
> "Small speedup" isn't worth the obscurity.  Note that Python contains no
> assembler now.

Just to add my 0.02 EUR.

You know that I'm not reluctant to use assembly for
platform specific speedups.
But first, I'm with Tim, not going this path for such
a small win.
Second, I'd like to point out that going to assembly
for such a huge function like eval_frame is rather
dangerous: All compilers have different ways of
handling the appearance of assembly. This is a dangerous
path, believe me:

MS C's behavior is one of the worst, which is the
reason why I was very careful to put this in a clean-room
for Stackless, for instance:
For the appearance of ASM code in some function, the
calling sequence and the optimization strategy are
changed drastically. Register allocation is changed,
the optimization level is reduced, and the calling
convention is *never* without stack frames.
This might not have changed eval_frame's behavior
too much, just because it is too big to benefit
from certain optimizations now, but I remember that
I changed it once to use about two registers less,
and I might re-apply these changes to give the eval loop
a boost of about 10 percent.
The existance of a single one asm statement would
voiden this effect!

Hint: Write a small, understandable function twice,
once using assembly and once without. Compile the
stuff, and set the listing option to everything.
Then look at the .cod file, and wonder how different
the two versions are.
This will make you very reluctant to use any asm statement
at all, unless you want to re-write the whole function
in assembly, including the "naked" option.

Doing the latter for eval_frame would be worthwhile,
but then I'd suggest to do this as an external .asm
file. If you do this right, taking cache lines and
probabilities into account, you can for sure create
an overall gain of up to 20 percent.

But even this remarkable gain wouldn't be enough,
even for me, to go this hard path for a single platform.

sincerely -- chris

Christian Tismer             :^)   <mailto:tismer@tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/