Concerns about performance w/Python, Pysco on Pentiums

Thu Mar 6 14:35:14 EST 2003

Michael Hudson wrote:
> 
> Peter Hansen <peter at engcorp.com> writes:
> 
> > Michael Hudson wrote:
> > >
> > > You might also try a list of functions rather than a dict.  I think
> > > psyco knows more about lists than dicts.  But ICBW.
> >
> > Ah, thanks.  My lateral-thinking cap must have been off.  Although
> > dicts are fast in Python in general, in this case the items in the
> > list are tightly constrained (opcodes, from 0 to 255) and a list
> > is quite feasible.  I'll give it a shot - thanks! :)
> 
> In which case, I'd expect lists to be faster than dicts w/o psyco too,
> 'cause of the special casing for list[int] in ceval.c.

Interesting...  I tried the (one-line!) change to use a list 
instead of a dict...  below are some results.  For reference, for those 
still interested, here are a few snippets showing sample code, with some
extraneous stuff removed for clarity:

------------------

class Opcode:
    opcodes = {}   # or use [None] * 256 to store in a list

    def __init__(self, code, name, length, cycles, mode):
        self.code = code
        self.name = name
        self.length = length
        self.execute = globals()['execute' + self.name]
        self.cycles = cycles
        self.mode = mode

        self.opcodes[self.code] = self

# example opcode: "no operation"
def executeNOP(cpu):
    pass

# example opcode: "load D register"
def executeLDD(cpu):
    cpu.setD(cpu.readUword(cpu.effectiveAddress))
    cpu.CCR_N = bool(cpu.D & 0x8000)
    cpu.CCR_Z = (cpu.D == 0)

# create some opcodes, storing references in class
Opcode(0xA7, 'NOP', 1, 1, 'INH')
Opcode(0xCC, 'LDD', 3, 2, 'IMM')

class Cpu:
   ....
    def __init__(self, name):
        self.name = name

        self.setCCR('sxhinzvc')
        self.D = 0
        self.A = 0
        self.B = 0
        self.X = 0
        self.Y = 0
        self.SP = 0
        self.PC = 0
        self.memory = [0] * 65536
        self.opcodes = Opcode.opcodes

    def step(self):
        opcodeByte = self.readByte(self.PC)
        try:
            opcode = self.opcodes[opcodeByte]
        except KeyError:
            raise UnimplementedOpcode('$%02X %s' % (opcodeByte,
self.readMemory(self.PC + 1, 5)))
        else:
            deltaPC, self.effectiveAddress = self.resolveAddress(opcode.mode)
            newPC = opcode.execute(self)

            if newPC is not None:
                self.PC = newPC
            else:
                self.PC += deltaPC + opcode.length

            self.cycles += opcode.cycles
   ...

Summarizing, opcodes are instantiated objects tracked in a class
variable in the Opcode class.  They are dispatched through their
execute attribute, which is bound on instantiation to the global
function of the appropriate name.  The Cpu class, instantiated 
once, has a step() method that is called repeatedly from a run()
method.  It grabs the opcode byte from memory, looks it up in
the single Opcode class list/dict, and calls the execute function,
passing itself as a parameter.

Here are the results, this time for a P3 730MHz machine. 
(Results are repeatable +/- about 500Hz.)

Original code w/dict, no Psyco:  69350 Hz  (A= baseline)
Original code w/dict, use Psyco: 94250 Hz  (B= A + 36%)

Variation with list, no Psyco:   70725 Hz  (C = A + 2%)
Original code w/list, use Psyco: 96190 Hz  (D = C + 36%, or A + 39%)

So the switch to use a list provides a minimal (2%) speedup,
while Psyco, properly used (*), manages to speed either approach
up by 36%.

* I apparently screwed up the first time I used Pysco, or maybe
it doesn't work nearly as well on a P266MMX, because on this machine
it's doing much better than 12%.  I'm not willing to claim that 
I'm actually using it "correctly" yet, since this is time #2 for me
using Pysco...

Tentative conclusion: although it's at the bottom of the range of 
claimed improvements from Psyco, I'll take my 36% and run.  I'll
switch to lists, because dicts have zero advantages in this case,
though the speedup is minor.

I'll take Chris L's research as conclusive about the somewhat 
dubious value of a superfast Pentium 4 chip compared to the lowly
P3 at a lower clock rate, and stop worrying about it.

And I doubt I'll bother playing around any more without 
(a) deciding the thing is "too slow" (which it's not, yet) and 
(b) actually finishing the code, and profiling it as one 
always should before optimizing.  ;-)

I very much appreciate all the input received to date, and those
responses still to come.

Cheers,
-Peter