[Python-Dev] Python-acceleration instructions on ARM

Wed Feb 11 08:27:34 CET 2009

>     ARM is specifically claiming that these instructions can be used to
>     accelerate Python interpretation.
> 
> 
> Wow, really? One of the links below mention that?

I'm skeptical though that you can really produce speedups for CPython,
though; ISTM that they added Python only as a front-end language for
Parrot, and added Parrot only because it looks similar to JVM and .NET
(i.e. without actually testing that you can gain performance).

>From reading the paper, ISTM that you *can* expect speedups for your
JIT-generated code. In ThumbEE, you have the following additional
features:

- fast null pointer checks: any register-indirect addressing in ThumbEE
  mode checks whether the base register is NULL; if it is, a callback
  is invoked (which could then throw NullPointerException). This is
  irrelevant in Python, because we don't use NULL as the value for "no
  object"
- fast array bounds check: there is an instruction that checks
  whether 0 <= Rm <= Rn, and invokes a callback if it's not; this
  would then throw ArrayOutOfBoundsException. This instruction would
  be emitted by JIT just before any array access. In Python, you cannot
  easily JIT array access into a direct machine instruction (as you
  need to go through tp_as_sequence->sq_item); the array bounds check
  would likely disappear in white noise.
- fast switch instruction: there is an efficient way to switch 256
  different byte code operations, with an optional immediate parameter.
  It will call/jump to 256 byte code handlers. This allows for a
  straight-forward JIT compiler which essentially compiles all byte
  codes into such switch instructions. That would work for Python as
  well, but require that ceval gets rewritten entirely.
- fast locals: efficient access to a local-variables array, for
  JIT generation of ldloc.i4 (in .NET, not sure what the Java
  byte code for local variables is). Would work as well for Python,
  assuming there is a JIT compiler in the first place. R9 holds
  the fastlocals pointer (which is good use of the register, since
  you cannot access it in Thumb mode, anyway)
- fast instance variables: likewise, with R10 holding the this
  pointer. Not applicable to Python, since there is no byte code
  for instance variable access.
- efficient array indexing: they give shift-and-index back to
  Thumb mode, for a shift by 2, allowing to index arrays with
  4-byte elements in a single instruction (rather than requiring
  a separate multipy-by-four). Again useful for JIT of array
  access instructions, not applicable to Python - although it
  would be nice if the C compiler knew how to emit that.

Regards,
Martin