# [pypy-svn] r77860 - in pypy/extradoc/talk/pepm2011: . figures

antocuni at codespeak.net antocuni at codespeak.net
Wed Oct 13 13:45:16 CEST 2010

Author: antocuni
Date: Wed Oct 13 13:45:15 2010
New Revision: 77860

Modified:
Log:
fix some typos, add few XXXs, improve sentences here and there

==============================================================================
Binary files. No diff available.

==============================================================================
Binary files. No diff available.

==============================================================================
Binary files. No diff available.

==============================================================================
+++ pypy/extradoc/talk/pepm2011/paper.bib	Wed Oct 13 13:45:15 2010
@@ -240,6 +240,7 @@
pages = {18--25}
},

+
@techreport{mason_chang_efficient_2007,
title = {Efficient {Just-In-Time} Execution of Dynamically Typed Languages
Via Code Specialization Using Precise Runtime Type Inference},
@@ -316,3 +317,13 @@
year = {2008},
pages = {123--139}
}
+
+
+ at phdthesis{cuni_python_cli_2010,
+	author = {Antonio Cuni},
+	title = {High performance implementation of {Python} for {CLI/.NET} with
+                  {JIT} compiler generation for dynamic languages.},
+	school = {Dipartimento di Informatica e Scienze dell'Informazione, University of Genova},
+	note = {Technical Report {DISI-TH-2010-05}},
+	year = {2010},
+},

==============================================================================
+++ pypy/extradoc/talk/pepm2011/paper.tex	Wed Oct 13 13:45:15 2010
@@ -28,6 +28,7 @@
\newcommand\arigo[1]{\nb{AR}{#1}}
\newcommand\fijal[1]{\nb{FIJAL}{#1}}
\newcommand\david[1]{\nb{DAVID}{#1}}
+\newcommand\anto[1]{\nb{ANTO}{#1}}
\newcommand\reva[1]{\nb{Reviewer 1}{#1}}
\newcommand\revb[1]{\nb{Reviewer 2}{#1}}
\newcommand\revc[1]{\nb{Reviewer 3}{#1}}
@@ -130,13 +131,13 @@

A recently popular approach to implementing just-in-time compilers for dynamic
languages is that of a tracing JIT. A tracing JIT works by observing the running
-program and recording linear execution traces, which are then turned into
+program and recording its hot parts into linear execution traces, which are then turned into
machine code. One reason for the popularity of tracing JITs is their relative
simplicity. They can often be added to an interpreter and a lot of the
infrastructure of an interpreter can be reused. They give some important
-produces linear pieces of code, which makes many optimizations that are usually
-hard in a compiler simpler, such as register allocation.
+produces linear pieces of code, which simplifies many optimizations that are usually
+hard in a compiler, such as register allocation.

The usage of a tracing JIT can remove the overhead of bytecode dispatch and that
of the interpreter data structures. In this paper we want to present an approach
@@ -198,7 +199,7 @@
\cite{carl_friedrich_bolz_back_2008} and a GameBoy emulator
\cite{bruni_pygirl:_2009}.

-The feature that makes PyPy more than a compiler with a runtime system is it's
+The feature that makes PyPy more than a compiler with a runtime system is its
support for automated JIT compiler generation \cite{bolz_tracing_2009}. During
the translation to C, PyPy's tools can generate a just-in-time compiler for the
language that the interpreter is implementing. This process is mostly
@@ -206,7 +207,13 @@
source-code hints. Mostly-automatically generating a JIT compiler has many advantages
over writing one manually, which is an error-prone and tedious process.
By construction, the generated JIT has the same semantics as the interpreter.
-Many optimizations can benefit all languages implemented as an interpreter in RPython.
+Many optimizations can benefit all languages implemented as an interpreter in RPython.
+
+Moreover, thanks to the internal design of the JIT generator, it is very easy
+to add new \emph{backends} for producing the actual machine code, in addition
+to the original backend for the Intel \emph{x86} architecture.  Examples of
+additional JIT backends are the one for Intel \emph{x86-64} and an
+experimental one for the CLI .NET Virtual Machine \cite{cuni_python_cli_2010}.
The JIT that is produced by PyPy's JIT generator is a \emph{tracing JIT
compiler}, a concept which we now explain in more details.

@@ -233,7 +240,7 @@
interpreter records all operations that it is executing while running one
iteration of the hot loop. This history of executed operations of one loop is
called a \emph{trace}. Because the trace corresponds to one iteration of a loop,
-it always ends with a jump to its own beginning. The trace also contains all
+it always ends with a jump to its own beginning \anto{this is not true: what if we trace the last iteration?}. The trace also contains all
operations that are performed in functions that were called in the loop, thus a
tracing JIT automatically performs inlining.

@@ -264,7 +271,9 @@
If one specific guard fails often enough, the tracing JIT will generate a new
trace that starts exactly at the position of the failing guard. The existing
assembler is patched to jump to the new trace when the guard fails
-\cite{andreas_gal_incremental_2006}.
+\cite{andreas_gal_incremental_2006}.  This approach guarantees that all the
+hot paths in the program will eventually be traced and compiled into efficient
+code.

\subsection{Running Example}

@@ -356,13 +365,14 @@
call to \texttt{add}. These method calls need to check the type of the involved
objects repeatedly and redundantly. In addition, a lot of objects are created
when executing that loop, many of these objects do not survive for very long.
-The actual computation that is performed by \texttt{f} is simply a number of
+The actual computation that is performed by \texttt{f} is simply a sequence of

\begin{figure}
\texttt{
\begin{tabular}{l}
+\# XXX: maybe we should specify that $p_{0}$, $p_{1}$ corresponds to y and res
\# arguments to the trace: $p_{0}$, $p_{1}$ \\
guard\_class($p_{1}$, BoxedInteger) \\
@@ -533,7 +543,8 @@
\texttt{get} from such an object, the result is read from the shape
description, and the operation is also removed. Equivalently, a
\texttt{guard\_class} on a variable that has a shape description can be removed
-as well, because the shape description stores the type.
+as well, because the shape description stores the type and thus the result of
+the guard is statically known.

In the example from last section, the following operations would produce two
static objects, and be completely removed from the optimized trace:
@@ -551,6 +562,8 @@
\texttt{BoxedInteger} whose \texttt{intval} field contains $i_{4}$; the
one associated with $p_{6}$ would know that it is a \texttt{BoxedInteger}
whose \texttt{intval} field contains the constant -100.
+\anto{this works only because we use SSI and thus the value of $i_{4}$ never
+changes. However, SSI is not explained anywhere in the paper}

The following operations on $p_{5}$ and $p_{6}$ could then be
optimized using that knowledge:
@@ -595,7 +608,7 @@
jump is emitted, they are \emph{lifted}. This means that the optimizer produces code
that allocates a new object of the right type and sets its fields to the field
values that the static object has (if the static object points to other static
-objects, those need to be lifted as well) This means that instead of the jump,
+objects, those need to be lifted as well, recursively) This means that instead of the jump,
the following operations are emitted:

\texttt{
@@ -613,7 +626,9 @@
the objects are still allocated at the end. However, the optimization was still
worthwhile even in this case, because some operations that have been performed
on the lifted static objects have been removed (some \texttt{get} operations
-and \texttt{guard\_class} operations).
+and \texttt{guard\_class} operations).  Moreover, in real life example usually
+the loops are more complex and contain more objects of type 1, thus this
+technique is more effective.

\begin{figure}
\includegraphics{figures/step1.pdf}
@@ -797,7 +812,7 @@
the variables, and the operation has to be residualized.

If the argument $v$ of a \texttt{get} operation is mapped to something in the static
-heap, the get can be performed at optimization time. Otherwise, the \texttt{get}
+heap, the \texttt{get} can be performed at optimization time. Otherwise, the \texttt{get}
operation needs to be residualized.

If the first argument $v$ to a \texttt{set} operation is mapped to something in the
@@ -919,14 +934,14 @@
\item \textbf{raytrace-simple}: A ray tracer.
\item \textbf{richards}: The Richards benchmark.
\item \textbf{spambayes}: A Bayesian spam filter\footnote{\texttt{http://spambayes.sourceforge.net/}}.
-    \item \textbf{telco}: A Python version of the Telco decimal.
+    \item \textbf{telco}: A Python version of the Telco decimal
benchmark\footnote{\texttt{http://speleotrove.com/decimal/telco.html}},
using a pure Python decimal floating point implementation.
\item \textbf{twisted\_names}: A DNS server benchmark using the Twisted networking
framework\footnote{\texttt{http://twistedmatrix.com/}}.
\end{itemize}

-We evaluate the allocation removal algorithm along two lines: First we want to
+We evaluate the allocation removal algorithm along two lines: first we want to
know how many allocations could be optimized away. On the other hand, we want
to know how much the run times of the benchmarks is improved.

@@ -935,6 +950,7 @@
seen in Figure~\ref{fig:numops}. The optimization removes as many as XXX and as
little as XXX percent of allocation operations in the benchmarks. All benchmarks
taken together, the optimization removes XXX percent of allocation operations.
+\anto{Actually, we can only know how many operations we removed from the traces, not from the actual execution}

\begin{figure*}
\begin{tabular}{lrrrrrrrrrrrrrrrrrrrrrrr}
@@ -966,7 +982,7 @@
CPython\footnote{\texttt{http://python.org}}, which uses a bytecode-based
interpreter. Furthermore we compared against Psyco \cite{rigo_representation-based_2004}, an extension to
CPython which is a just-in-time compiler that produces machine code at run-time.
-It is not based on traces. Of PyPy's Python interpreter we used three versions,
+It is not based on traces. Finally, we used three versions of PyPy's Python interpreter:
one without a JIT, one including the JIT but not using the allocation removal
optimization, and one using the allocation removal optimizations.