[Patches] [Patch #100893] new EXTENDED_ARG opcode, elimiate 16-bit oparg limit

Wed, 9 Aug 2000 04:36:09 -0700

Patch #100893 has been updated. 

Project: 
Category: core (C code)
Status: Rejected
Summary: new EXTENDED_ARG opcode, elimiate 16-bit oparg limit

Follow-Ups:

Date: 2000-Jul-15 05:24
By: twouters

Comment:
Can you elaborate on the point of this patch ? It removes the current 32k expression limit, but the cost is pretty high. Is the 32k limit really a problem ? Would a 64k limit, by fixing the current signed-byte problems, suffice ? Also, the patch seems to rely on the fact that ints are always 32bit. That is bound to be an unsafe assumption in the long run.

-------------------------------------------------------

Date: 2000-Jul-15 18:32
By: cgw

Comment:
I don't think it's really true that the cost of
this patch is high.  I did timing tests, on 3
different machines; my (admittedly somewhat
unscientific) test was to run "python pystone.py"
100x and collect statistics on the results, for both
the original and patched Python interpreters.  Values
cited are PyStones.

Here are the results: 
============================================= 
On an SGI IRIX64 6.5 IP27 
  built with native compiler MIPSpro Compilers: Version 7.2.1  
 Unpatched:  N=100 mean=2638.05 std. dev.=42.8 
 Patched:    N=100 mean=2656.19 std. dev.=14.8 
 Difference: +0.7% 

 built with gcc version 2.95.2 19991024 (release) 
 Unpatched:  N=100 mean=2171.77 std. dev.=8.69 
 Patched:    N=100 mean=2192.73 std. dev.=9.80 
 Difference: +1% 
============================================= 
On a SunOS  5.6  sun4u sparc Ultra-Enterprise 
 built with native compiler WorkShop Compilers 4.2  
 Unpatched:  N=100 mean=1962.32 std dev=29.79 
 Patched:    N=100 mean=1913.84 std dev=8.705 
 Difference: -2.5% 
 built with gcc version 2.95.2 19991024 (release)                        
 Unpatched:  N=100 mean=1859.08 std dev=11.68 
 Patched:    N=100 mean=1939.78 std dev=11.97 
 Difference: +4.3% 
============================================= 
On Linux 2.2.16 SMP  
 built with gcc version 2.95.2 19991024 (release) 
 Unpatched:  N=100 mean=4498.78 std dev=102.61 
 Patched:    N=100 mean=4592.40 std dev=102.38 
 Difference: +2%  

I think that looking ahead in the instruction stream
is not costly because the bytecode at instruction n+2
is probably already in the CPU's level 1 or level 2 
cache; if not, "prefetching" this instruction does not
have any adverse affects on performace, because then
this instruction will be available when it is needed
(which will usually be almost immediately).  In many
cases, actually, the code with my patch runs a bit
faster.  I've also reduced the amount of arithmetic required
in the case of little-endian machines - rather than
bit-shifting and adding to get a short out of two bytes,
I simply use a pointer-cast.  Anyhow, try the patch out,
I think the speed difference is negligible.

To answer your other points - yes, I think the 32K limit is
a problem - I've seen cases where machine-generated files are longer than this.  And I'm not too happy with the idea
of replacing a 32K limit with a 64K limit.  

Finally, I don't see where I'm assuming that ints are
always 32 bit - I assume that a long is at least 32
bits, but unless I'm missing something this code will work
just fine on 64-bit machines.

Feedback is of course always welcome....

-------------------------------------------------------

Date: 2000-Jul-16 02:16
By: twouters

Comment:
I wasn't talking as much about speed cost as about complexity cost, but it's good to see that speed cost is negligible. I'm a bit worried that this extra logic in NEXTARG() is a bit too complex, but then I've never come close to the 32k limit, and I don't see a better solution without rewriting the entire instruction stream thing. Isn't it easier to adjust the code-generator to add more groupings ? I'm not sure if that'll work in all cases, though, as I haven't such a code-generator ;)

The reason the patch is `dangerous' is that you rely on ints small enough to represent in 4 bytes: you removed the overflow check, which means that on 64bit machines, if you have more than 2**31 child nodes, you'll generate faulty bytecode.

I'd say Tim or Guido need to take a look at this, and I guess they're too busy packing (or, in  Tim's case, unpacking stuff he doesn't want to take ;) for ORA's OSCON.

-------------------------------------------------------

Date: 2000-Jul-16 16:16
By: cgw

Comment:
It looks like I took out all the overflow checking, but in
fact the code is safe.  com_addbyte in compile.c checks
that its (int) argument is in the range 0<=x<=255.

The checks against SHRT_MAX that my patch removes were only added recently, and are unneccessary if EXTENDED_ARG is 
available.

-------------------------------------------------------

Date: 2000-Jul-27 07:18
By: gvanrossum

Comment:
This will be too slow -- it adds a lot of complexity to the decoding of *each* instruction, and that's the most time-sensitive code in all of ceval.c. I would consider a patch that introduces a new opcode that specifies the high two bytes of oparg as a prefix, repeats the opcode decoding inline,  and then jumps to the top of the switch -- this should only cost when 32-bit opargs are needed (plus a little bit due to code rearrangements, but that's unavoidable).  Pseudo-code:

label:
  switch  (opcode) {
  case LONG_ARG
    realopcode = NEXTOP();
    oparg = oparg<<16 | NEXTARG();
    goto label;
  ...other cases as before...
  }

Thus, for a REAL_OP with a 32-bit oparg, the code generator should generate code like:

  LONG_ARG <high 16 bits of oparg>
  <REAL_OP> <low 16 bits of oparg>
-------------------------------------------------------

Date: 2000-Jul-27 22:50
By: tim_one

Comment:
Sorry, guys -- there was so much email today that I didn't even see this until now.

Yes, I saw the timing results, and wasn't surprised.  As the one one-time professional optimization guy <wink> hanging out on c.l.py, I get sucked into these things a lot.  Across architectures and compilers, there's no way to predict whether a small change will speed up or slow down.  And ceval.c pushes current combos to their limits:  Marc-Andre once put an *unexecuted* printf into the main loop and measured a ~15% slowdown on his combo as a result.  On my combo, it appeared to yield a slight speedup.

That doesn't mean it's hopeless, though!  Over time, the length of the critical path through the code is the best predictor of how well compilers and architectures will eventually perform.  Reduce the operation count on the critical path, and you almost always win in the end; increase it, and you almost always lose.  That's my real objection to the original patch:  instruction decode *is* on the critical path, and sticking another test+branch in there is a long-term loser no matter what tests say today on a handful of combos.  Optimizers get better over time, and architectures reward simpler code over time.

Greg Ewing had another interesting suggestion on c.l.py today:  don't even fetch the second byte of 2-byte instructions at the top of loop; wait until you're in the cases that actually need the next byte.  The amount of code duplication in that is unattractive, though, and the switch in ceval is definitely fat enough that 2nd-order effects like instruction cache behavior can make a real difference (indeed, that's what I believe was going on with Marc-Andre's passive printf).

BTW, you cannot assume that a short is 2 bytes!  Python runs on machines where sizeof(short) == 8.

-------------------------------------------------------

Date: 2000-Aug-02 14:57
By: cgw

Comment:
Here's a completely reworked version of the EXTENDED_ARG patch which inserts the EXTENDED_ARG bytecode before the opcode it modifies, rather than after it.  This avoids adding extra glop to the critical section of instruction decoding.

-------------------------------------------------------

Date: 2000-Aug-08 22:26
By: tim_one

Comment:
Changed status to Rejected, but left assigned to me (I'll Open it again when it's updated).  As I said at the bottom of my last comment on this, you cannot assume sizeof(short)==2:  please get rid of the new big-endian/little-endian trickery.  The code is not portable as-is.  Even on machines where sizeof(short) does equal 2, shorts in the opcode stream are not guaranteed to be naturally aligned, and trying to access them as if they were can lead to anything from a core dump to an expensive trap to the OS to fix up the unaligned read.  And on the only machines where this gimmick could actually pay (a little-endian machine where sizeof(short)==2 *and* the HW doesn't penalize unnatural alignment), the compiler should be smart enough to optimize the old expression for us.
-------------------------------------------------------

Date: 2000-Aug-09 04:36
By: gvanrossum

Comment:
I think that all Tim wants is that you undo the changes you made to the NEXTARG() macro.
-------------------------------------------------------

-------------------------------------------------------
For more info, visit:

http://sourceforge.net/patch/?func=detailpatch&patch_id=100893&group_id=5470