[Python-Dev] Bytecode analysis

damien morton dmorton@bitfurnace.com
Wed, 26 Feb 2003 00:59:11 -0500


I implemented LOAD_FAST_n, STORE_FAST_n, LOAD_CONST_n for n < 16

Getting a small 2% improvement in speed
Going from about 21800 PyStones to 22300 PyStones; very hard to get
consistent readings on the PyStones - anyone got any tips on how to get
more consistent results under windows?

Getting a small 3% reduction in .pyc filesizes
os.path 24,929 unmodified
os.path 24,149 with modifications

I sort of cheated on the switch statement to avoid the use of a goto.

		opcode = NEXTOP();
		if (HAS_ARG(opcode))
			oparg = NEXTARG();
		...
		switch (opcode) {
		...
		case LOAD_FAST_14:
		case LOAD_FAST_15: 
			oparg = opcode - LOAD_FAST_0;
		case LOAD_FAST:
			x = GETLOCAL(oparg);
			if (x != NULL) {
				Py_INCREF(x);
			...

I also altered the opcode.h file to use an enum for the opcodes instead
of all those #defines. Much easier to re-arrange things that way. I have
a feeling that most of the speedup (such that it is) comes from that
re-arrangment, which packs the opcodes into a contiguous numeric space.
I suspect that sorting the opcodes by frequency of access might also
have some positive effect. Also, organising the opcodes and the switch
statement so that frequently co-occuring opcodes are adjacent to each
other might also have some positive effect.

> -----Original Message-----
> From: guido@python.org [mailto:guido@python.org] 
> Sent: Tuesday, 25 February 2003 20:25
> To: damien morton
> Cc: python-dev@python.org
> Subject: Re: [Python-Dev] Bytecode analysis
> 
> 
> > As you say, LOAD_FAST is a very frequently occuring 
> instruction, both 
> > statically and dynamically. Reducing it from a 3 byte 
> instruction to a 
> > 1 byte instruction in 97% of (static) cases should be an 
> overall good.
> > 
> > Most of the opcodes I proposed could be added without disturbing 
> > locality of reference.
> > 
> > e.g.
> > 
> > switch (op = *p++) {
> >    ...
> >    case LOAD_FAST:
> >       index = (*p++) + (*p++)<<8
> >       goto LOAD_FAST_MAIN;
> > 	break;
> >    case LOAD_FAST_0:
> >    case LOAD_FAST_1:
> >    case LOAD_FAST_15:
> >       index = op - LOAD_FAST_0
> >       LOAD_FAST_MAIN:
> > 	...
> >       break;
> > 
> >        
> > }
> 
> Good idea.  Can you benchmark this?
> 
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> 
>