[Python-Dev] Python-acceleration instructions on ARM
"Martin v. Löwis"
martin at v.loewis.de
Wed Feb 11 08:27:34 CET 2009
> ARM is specifically claiming that these instructions can be used to
> accelerate Python interpretation.
>
>
> Wow, really? One of the links below mention that?
I'm skeptical though that you can really produce speedups for CPython,
though; ISTM that they added Python only as a front-end language for
Parrot, and added Parrot only because it looks similar to JVM and .NET
(i.e. without actually testing that you can gain performance).
>From reading the paper, ISTM that you *can* expect speedups for your
JIT-generated code. In ThumbEE, you have the following additional
features:
- fast null pointer checks: any register-indirect addressing in ThumbEE
mode checks whether the base register is NULL; if it is, a callback
is invoked (which could then throw NullPointerException). This is
irrelevant in Python, because we don't use NULL as the value for "no
object"
- fast array bounds check: there is an instruction that checks
whether 0 <= Rm <= Rn, and invokes a callback if it's not; this
would then throw ArrayOutOfBoundsException. This instruction would
be emitted by JIT just before any array access. In Python, you cannot
easily JIT array access into a direct machine instruction (as you
need to go through tp_as_sequence->sq_item); the array bounds check
would likely disappear in white noise.
- fast switch instruction: there is an efficient way to switch 256
different byte code operations, with an optional immediate parameter.
It will call/jump to 256 byte code handlers. This allows for a
straight-forward JIT compiler which essentially compiles all byte
codes into such switch instructions. That would work for Python as
well, but require that ceval gets rewritten entirely.
- fast locals: efficient access to a local-variables array, for
JIT generation of ldloc.i4 (in .NET, not sure what the Java
byte code for local variables is). Would work as well for Python,
assuming there is a JIT compiler in the first place. R9 holds
the fastlocals pointer (which is good use of the register, since
you cannot access it in Thumb mode, anyway)
- fast instance variables: likewise, with R10 holding the this
pointer. Not applicable to Python, since there is no byte code
for instance variable access.
- efficient array indexing: they give shift-and-index back to
Thumb mode, for a shift by 2, allowing to index arrays with
4-byte elements in a single instruction (rather than requiring
a separate multipy-by-four). Again useful for JIT of array
access instructions, not applicable to Python - although it
would be nice if the C compiler knew how to emit that.
Regards,
Martin
More information about the Python-Dev
mailing list