Python-acceleration instructions on ARM
Dear Python developers, Introduction: I am writing from the perspective of Sugar Labs [1], which produces Sugar, a free software project written almost entirely in Python. Sugar is designed to run on small, resource-constrained computers. So far those computers have been mostly x86, but it seems likely to me that we will soon want to run on ARM as well, with the new wave of small ARM laptops [2]. These laptops are likely to run on some variant of the ARM Cortex-A8 CPU core. The Cortex-A8 chips all contain a set of commands known as ThumbEE or Jazelle RCT (Runtime Compilation Target) [3]. According to ARM [4]: """Jazelle RCT can be used to significantly reduce the code bloat associated with AOT and JIT compilation, making AOT technology viable on mass-market devices. It can also be used to support execution environments beyond Java, such as Microsoft .NET Compact Framework, Python and others.""" """Jazelle RCT provides an excellent target for any run-time compilation technology, including JIT and AOT for .NET MSIL, Python and Perl as well as Java. ARM is working with leading software providers to enable solutions ready for market with Jazelle RCT.""" The Jazelle RCT system consists of 12 assembly instructions, documented at [5] and [6]. Question: ARM is specifically claiming that these instructions can be used to accelerate Python interpretation. Is there any interpreter code, in CPython or elsewhere, that uses ThumbEE mode? Is there anyone working on this? What would the process be to incorporate the use of ThumbEE instructions into CPython? The whitepaper mentions the Parrot interpreter specifically, but I cannot find any indication that anyone is actually working on Jazelle RCT support in Parrot. Thank you, Ben Schwartz [1] http://sugarlabs.org/go/Main_Page [2] http://www.engadget.com/2009/01/09/pegatron-and-freescale-team-for-low-power... [3] http://en.wikipedia.org/wiki/ARM_architecture#Thumb_Execution_Environment_.2... [4] http://www.arm.com/products/multimedia/java/jazelle_architecture.html [5] http://infocenter.arm.com/help/topic/com.arm.doc.dui0379a/CIHBCDGA.html [6] http://www.arm.com/pdfs/JazelleRCTWhitePaper_final1-0_.pdf -- View this message in context: http://www.nabble.com/Python-acceleration-instructions-on-ARM-tp21947336p219... Sent from the Python - python-dev mailing list archive at Nabble.com.
On Tue, Feb 10, 2009 at 18:45, Benjamin Schwartz <bmschwar@fas.harvard.edu>wrote:
Dear Python developers,
Introduction: I am writing from the perspective of Sugar Labs [1], which produces Sugar, a free software project written almost entirely in Python. Sugar is designed to run on small, resource-constrained computers. So far those computers have been mostly x86, but it seems likely to me that we will soon want to run on ARM as well, with the new wave of small ARM laptops [2]. These laptops are likely to run on some variant of the ARM Cortex-A8 CPU core.
The Cortex-A8 chips all contain a set of commands known as ThumbEE or Jazelle RCT (Runtime Compilation Target) [3]. According to ARM [4]:
"""Jazelle RCT can be used to significantly reduce the code bloat associated with AOT and JIT compilation, making AOT technology viable on mass-market devices. It can also be used to support execution environments beyond Java, such as Microsoft .NET Compact Framework, Python and others."""
"""Jazelle RCT provides an excellent target for any run-time compilation technology, including JIT and AOT for .NET MSIL, Python and Perl as well as Java. ARM is working with leading software providers to enable solutions ready for market with Jazelle RCT."""
The Jazelle RCT system consists of 12 assembly instructions, documented at [5] and [6].
Question: ARM is specifically claiming that these instructions can be used to accelerate Python interpretation.
Wow, really? One of the links below mention that?
Is there any interpreter code, in CPython or elsewhere, that uses ThumbEE mode?
Nope.
Is there anyone working on this?
Not that has contacted us.
What would the process be to incorporate the use of ThumbEE instructions into CPython?
Well, this all depends on how you try to integrate the instructions. If you hide it behind the macro or in a clean way that does not penalize skipping the instructions then you write a patch. But if this can't be done it would be better to maintain an external set of patches against trunk for this. This might be pushed anyway as we have slowly been shying away from platform/CPU-specific code being in the trunk, especially when it does not come from someone who has been a Python core developer for several years. -Brett
The whitepaper mentions the Parrot interpreter specifically, but I cannot find any indication that anyone is actually working on Jazelle RCT support in Parrot.
Thank you, Ben Schwartz
[1] http://sugarlabs.org/go/Main_Page [2]
http://www.engadget.com/2009/01/09/pegatron-and-freescale-team-for-low-power... [3]
http://en.wikipedia.org/wiki/ARM_architecture#Thumb_Execution_Environment_.2... [4] http://www.arm.com/products/multimedia/java/jazelle_architecture.html [5] http://infocenter.arm.com/help/topic/com.arm.doc.dui0379a/CIHBCDGA.html [6] http://www.arm.com/pdfs/JazelleRCTWhitePaper_final1-0_.pdf -- View this message in context: http://www.nabble.com/Python-acceleration-instructions-on-ARM-tp21947336p219... Sent from the Python - python-dev mailing list archive at Nabble.com.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org
Brett Cannon wrote:
On Tue, Feb 10, 2009 at 18:45, Benjamin Schwartz <bmschwar@fas.harvard.edu>wrote:
...
According to ARM [4]:
"""Jazelle RCT can be used to significantly reduce the code bloat associated with AOT and JIT compilation, making AOT technology viable on mass-market devices. It can also be used to support execution environments beyond Java, such as Microsoft .NET Compact Framework, Python and others."""
"""Jazelle RCT provides an excellent target for any run-time compilation technology, including JIT and AOT for .NET MSIL, Python and Perl as well as Java. ARM is working with leading software providers to enable solutions ready for market with Jazelle RCT.""" ... Question: ARM is specifically claiming that these instructions can be used to accelerate Python interpretation.
Wow, really? One of the links below mention that?
Yes. The quotes above from [4], as well as the white paper [6]. No specific data, just these broad claims.
What would the process be to incorporate the use of ThumbEE instructions into CPython?
Well, this all depends on how you try to integrate the instructions. If you hide it behind the macro or in a clean way that does not penalize skipping the instructions then you write a patch. But if this can't be done it would be better to maintain an external set of patches against trunk for this.
Interesting. Sugar Labs will probably not attempt this if we would have to maintain a patched interpreter forever. However, I hope it will be possible to integrate into CPython in a manner that does not uglify the code or affect other architectures. Anyone else interested in ARM? ThumbEE support would benefit anyone running Python on recent ARM chips. Maybe we need to create a working group/project team/whatever.
[4] http://www.arm.com/products/multimedia/java/jazelle_architecture.html [6] http://www.arm.com/pdfs/JazelleRCTWhitePaper_final1-0_.pdf
On Feb, 11 2009 at 04:11:AM, Benjamin M. Schwartz <bmschwar@fas.harvard.edu> wrote:
Brett Cannon wrote:
On Tue, Feb 10, 2009 at 18:45, Benjamin Schwartz <bmschwar@fas.harvard.edu>wrote:
...
According to ARM [4]:
"""Jazelle RCT can be used to significantly reduce the code bloat associated with AOT and JIT compilation, making AOT technology viable on mass-market devices. It can also be used to support execution environments beyond Java, such as Microsoft .NET Compact Framework, Python and others."""
"""Jazelle RCT provides an excellent target for any run-time compilation technology, including JIT and AOT for .NET MSIL, Python and Perl as well as Java. ARM is working with leading software providers to enable solutions ready for market with Jazelle RCT.""" ... Question: ARM is specifically claiming that these instructions can be used to accelerate Python interpretation.
Wow, really? One of the links below mention that?
Yes. The quotes above from [4], as well as the white paper [6]. No specific data, just these broad claims.
What would the process be to incorporate the use of ThumbEE instructions into CPython?
Well, this all depends on how you try to integrate the instructions. If you hide it behind the macro or in a clean way that does not penalize skipping the instructions then you write a patch. But if this can't be done it would be better to maintain an external set of patches against trunk for this.
Interesting. Sugar Labs will probably not attempt this if we would have to maintain a patched interpreter forever. However, I hope it will be possible to integrate into CPython in a manner that does not uglify the code or affect other architectures.
Anyone else interested in ARM? ThumbEE support would benefit anyone running Python on recent ARM chips. Maybe we need to create a working group/project team/whatever.
[4] http://www.arm.com/products/multimedia/java/jazelle_architecture.html [6] http://www.arm.com/pdfs/JazelleRCTWhitePaper_final1-0_.pdf
It's not useful for CPython, since it's based on a loop which evaluates a bytecode at the time. You have to rewrite the virtual machine implementing a JIT compiler that generates Thumb-EE instructions. But it's a big effort, since ceval.c works in a completely different manner. I don't know if a form of JIT will be implemented in future CPython implementations, but if a step in this direction will be made, writing a back-end that uses Thumb-EE will be much easier. Cheers, Cesare
Cesare Di Mauro wrote:
It's not useful for CPython, since it's based on a loop which evaluates a bytecode at the time.
You have to rewrite the virtual machine implementing a JIT compiler that generates Thumb-EE instructions. But it's a big effort, since ceval.c works in a completely different manner.
I don't know if a form of JIT will be implemented in future CPython implementations, but if a step in this direction will be made, writing a back-end that uses Thumb-EE will be much easier.
It is beginning to sound like PyPy may be a more appropriate platform for this experimentation than CPython. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
ARM is specifically claiming that these instructions can be used to accelerate Python interpretation.
Wow, really? One of the links below mention that?
I'm skeptical though that you can really produce speedups for CPython, though; ISTM that they added Python only as a front-end language for Parrot, and added Parrot only because it looks similar to JVM and .NET (i.e. without actually testing that you can gain performance).
From reading the paper, ISTM that you *can* expect speedups for your JIT-generated code. In ThumbEE, you have the following additional features:
- fast null pointer checks: any register-indirect addressing in ThumbEE mode checks whether the base register is NULL; if it is, a callback is invoked (which could then throw NullPointerException). This is irrelevant in Python, because we don't use NULL as the value for "no object" - fast array bounds check: there is an instruction that checks whether 0 <= Rm <= Rn, and invokes a callback if it's not; this would then throw ArrayOutOfBoundsException. This instruction would be emitted by JIT just before any array access. In Python, you cannot easily JIT array access into a direct machine instruction (as you need to go through tp_as_sequence->sq_item); the array bounds check would likely disappear in white noise. - fast switch instruction: there is an efficient way to switch 256 different byte code operations, with an optional immediate parameter. It will call/jump to 256 byte code handlers. This allows for a straight-forward JIT compiler which essentially compiles all byte codes into such switch instructions. That would work for Python as well, but require that ceval gets rewritten entirely. - fast locals: efficient access to a local-variables array, for JIT generation of ldloc.i4 (in .NET, not sure what the Java byte code for local variables is). Would work as well for Python, assuming there is a JIT compiler in the first place. R9 holds the fastlocals pointer (which is good use of the register, since you cannot access it in Thumb mode, anyway) - fast instance variables: likewise, with R10 holding the this pointer. Not applicable to Python, since there is no byte code for instance variable access. - efficient array indexing: they give shift-and-index back to Thumb mode, for a shift by 2, allowing to index arrays with 4-byte elements in a single instruction (rather than requiring a separate multipy-by-four). Again useful for JIT of array access instructions, not applicable to Python - although it would be nice if the C compiler knew how to emit that. Regards, Martin
Martin v. Löwis <martin <at> v.loewis.de> writes:
- efficient array indexing: they give shift-and-index back to Thumb mode, for a shift by 2, allowing to index arrays with 4-byte elements in a single instruction (rather than requiring a separate multipy-by-four). Again useful for JIT of array access instructions, not applicable to Python - although it would be nice if the C compiler knew how to emit that.
This could be used in PyTuple_GetItem and PyList_GetItem, no? (assuming Thumb has 4-byte pointers)
Antoine Pitrou <solipsis@pitrou.net> wrote:
Martin v. Löwis <martin <at> v.loewis.de> writes:
- efficient array indexing: they give shift-and-index back to Thumb mode, for a shift by 2, allowing to index arrays with 4-byte elements in a single instruction (rather than requiring a separate multipy-by-four). Again useful for JIT of array access instructions, not applicable to Python - although it would be nice if the C compiler knew how to emit that.
This could be used in PyTuple_GetItem and PyList_GetItem, no?
Yes, but it's a compiler (Thumb-EE specific back-end) burden. Otherwise, we can introduce Thumb-EE assembly code were needed, but the same can happen for a wide range of ISAs. At least IA32 and AMD64 have specific addressing modes where it's possibile to use a multiplying factor of 1, 2, 4 or 8 for the index register. I hope that compilers were smart enough to already used them.
(assuming Thumb has 4-byte pointers)
Yes, it does. Cesare
At least IA32 and AMD64 have specific addressing modes where it's possibile to use a multiplying factor of 1, 2, 4 or 8 for the index register.
I hope that compilers were smart enough to already used them.
For x86, certainly (at least GCC does). For Thumb, certainly not: the compiler cannot assume that the code is in ThumbEE mode. Regards, Martin
Antoine Pitrou wrote:
Martin v. Löwis <martin <at> v.loewis.de> writes:
- efficient array indexing: they give shift-and-index back to Thumb mode, for a shift by 2, allowing to index arrays with 4-byte elements in a single instruction (rather than requiring a separate multipy-by-four). Again useful for JIT of array access instructions, not applicable to Python - although it would be nice if the C compiler knew how to emit that.
This could be used in PyTuple_GetItem and PyList_GetItem, no? (assuming Thumb has 4-byte pointers)
Yes - but it would require an assembly version of these functions; I'm skeptical that the savings would be measurable (given that there is also the type check and the range check). OTOH, PyTuple_GET_ITEM could probably be implemented as inline assembly. Regards, Martin
participants (7)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Benjamin M. Schwartz
-
Benjamin Schwartz
-
Brett Cannon
-
Cesare Di Mauro
-
Nick Coghlan