
In the BINARY_ADD opcode, and in most arithmetic opcodes, there is a line that checks for overflow that looks like this: if ((i^a) < 0 && (i^b) < 0) goto slow_add; I got a small speedup by replacing this with a macro defined thusly: #if defined(_MSC_VER) and defined(_M_IX86) #define IF_OVERFLOW_GOTO(X) __asm { jo X }; #else #define IF_OVERFLOW_GOTO(X) if ((i^a) < 0 && (i^b) < 0) goto X; #endif Would this case be an acceptable use of snippets of inline assembler?

[Damien Morton]
In the BINARY_ADD opcode, and in most arithmetic opcodes,
Aren't add and subtract the whole story here?
there is a line that checks for overflow that looks like this:
if ((i^a) < 0 && (i^b) < 0) goto slow_add;
I got a small speedup by replacing this with a macro defined thusly:
#if defined(_MSC_VER) and defined(_M_IX86)
"and" isn't C, so I assume you were very lucky <wink>.
#define IF_OVERFLOW_GOTO(X) __asm { jo X }; #else #define IF_OVERFLOW_GOTO(X) if ((i^a) < 0 && (i^b) < 0) goto X; #endif
Would this case be an acceptable use of snippets of inline assembler?
If you had said "a huge speedup, on all programs", on the weak end of maybe. "Small speedup" isn't worth the obscurity. Note that Python contains no assembler now.

Tim Peters wrote:
[Damien Morton]
In the BINARY_ADD opcode, and in most arithmetic opcodes,
Aren't add and subtract the whole story here?
there is a line that checks for overflow that looks like this:
if ((i^a) < 0 && (i^b) < 0) goto slow_add;
I got a small speedup by replacing this with a macro defined thusly:
#if defined(_MSC_VER) and defined(_M_IX86)
"and" isn't C, so I assume you were very lucky <wink>.
#define IF_OVERFLOW_GOTO(X) __asm { jo X }; #else #define IF_OVERFLOW_GOTO(X) if ((i^a) < 0 && (i^b) < 0) goto X; #endif
Would this case be an acceptable use of snippets of inline assembler?
If you had said "a huge speedup, on all programs", on the weak end of maybe. "Small speedup" isn't worth the obscurity. Note that Python contains no assembler now.
Just to add my 0.02 EUR. You know that I'm not reluctant to use assembly for platform specific speedups. But first, I'm with Tim, not going this path for such a small win. Second, I'd like to point out that going to assembly for such a huge function like eval_frame is rather dangerous: All compilers have different ways of handling the appearance of assembly. This is a dangerous path, believe me: MS C's behavior is one of the worst, which is the reason why I was very careful to put this in a clean-room for Stackless, for instance: For the appearance of ASM code in some function, the calling sequence and the optimization strategy are changed drastically. Register allocation is changed, the optimization level is reduced, and the calling convention is *never* without stack frames. This might not have changed eval_frame's behavior too much, just because it is too big to benefit from certain optimizations now, but I remember that I changed it once to use about two registers less, and I might re-apply these changes to give the eval loop a boost of about 10 percent. The existance of a single one asm statement would voiden this effect! Hint: Write a small, understandable function twice, once using assembly and once without. Compile the stuff, and set the listing option to everything. Then look at the .cod file, and wonder how different the two versions are. This will make you very reluctant to use any asm statement at all, unless you want to re-write the whole function in assembly, including the "naked" option. Doing the latter for eval_frame would be worthwhile, but then I'd suggest to do this as an external .asm file. If you do this right, taking cache lines and probabilities into account, you can for sure create an overall gain of up to 20 percent. But even this remarkable gain wouldn't be enough, even for me, to go this hard path for a single platform. sincerely -- chris -- Christian Tismer :^) <mailto:tismer@tismer.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/

-----Original Message----- From: Tim Peters [mailto:tim.one@comcast.net] Sent: Saturday, 8 March 2003 22:42 To: Damien Morton Cc: python-dev@python.org Subject: RE: [Python-Dev] acceptability of asm in python code?
[Damien Morton]
In the BINARY_ADD opcode, and in most arithmetic opcodes,
Aren't add and subtract the whole story here?
ADD, SUBTRACT and INPLACE variants, yes. Potentially also MULTIPLY.
there is a line that checks for overflow that looks like this:
if ((i^a) < 0 && (i^b) < 0) goto slow_add;
I got a small speedup by replacing this with a macro defined thusly:
#if defined(_MSC_VER) and defined(_M_IX86)
"and" isn't C, so I assume you were very lucky <wink>.
I had been using _MSC_VER, but decided to be a bit more specific for my post. Youre right, of course, the define I posted would not have worked.
#define IF_OVERFLOW_GOTO(X) __asm { jo X }; #else #define IF_OVERFLOW_GOTO(X) if ((i^a) < 0 && (i^b) < 0) goto X; #endif
Would this case be an acceptable use of snippets of inline assembler?
If you had said "a huge speedup, on all programs", on the weak end of maybe. "Small speedup" isn't worth the obscurity. Note that Python contains no assembler now.
Its arguable which is more obscure, the x86 assembly instruction "jo" (jump if overflow), or the xor trickery in C. <wink> I take your point, though, about there being no assembly in python now.

In article <000501c2e5f8$c384b6e0$6401a8c0@damien>, "damien morton" <dmorton@bitfurnace.com> wrote:
If you had said "a huge speedup, on all programs", on the weak end of maybe. "Small speedup" isn't worth the obscurity. Note that Python contains no assembler now.
Its arguable which is more obscure, the x86 assembly instruction "jo" (jump if overflow), or the xor trickery in C. <wink>
I take your point, though, about there being no assembly in python now.
The place to put this sort of low-level instruction optimization is in the peepholer of your C compiler. -- David Eppstein http://www.ics.uci.edu/~eppstein/ Univ. of California, Irvine, School of Information & Computer Science

[damien morton]
Its arguable which is more obscure, the x86 assembly instruction "jo" (jump if overflow), or the xor trickery in C. <wink>
It's not just the assembler, it's also the world of delicate assumptions about how the compiler interleaves generated C code with the forced inline assembler, how that affects optimization in general (see Chris Tismer's post about that), and how brittle that all is. One example of the latter: an idea that resurfaces from time to time is to make Python "short ints" the platform spelling of a 64-bit int. The C overflow-checking code wouldn't be affected by that (part of the reason it's obscure is that it makes no assumption about the size of a Python int). With the inline assembler, though, it would just break -- jo would pick up some accidental setting of the overflow flag under MSVC, or we'd have to arrange to generate __int64 addition code that set the flag the way the macro expects. For a little speedup on the sole operation(s) it targets, it's just not worth the ongoing puzzles. BTW, I'm not sure it's possible to buy a PC anymore less than twice as fast as the one I'm using right now <wink>.
I take your point, though, about there being no assembly in python now.
There's one place I wish there were: I wish Jeremy had time to fold in his bit of assembler to read the Pentium's clock register. That's a wonderful facility we can't get at now, and the assembler would be limited to a tiny and isolated function.
participants (5)
-
Christian Tismer
-
damien morton
-
Damien Morton
-
David Eppstein
-
Tim Peters