arigo at codespeak.net arigo at codespeak.net
Tue May 29 15:31:42 CEST 2007

Author: arigo
Date: Tue May 29 15:31:41 2007
New Revision: 43848

- copied, changed from r43833, pypy/extradoc/talk/dyla2007/dyla.bib
Modified:
Log:
Finish LaTeXification, complete things a bit.

==============================================================================
+++ pypy/extradoc/talk/dls2007/Makefile	Tue May 29 15:31:41 2007
@@ -1,7 +1,7 @@

-paper.pdf: paper.tex #paper.bib image/*.pdf
-	#pdflatex paper
-	#bibtex paper
+paper.pdf: paper.tex paper.bib
+	pdflatex paper
+	bibtex paper
pdflatex paper
pdflatex paper

==============================================================================
+++ pypy/extradoc/talk/dls2007/paper.tex	Tue May 29 15:31:41 2007
@@ -52,21 +52,36 @@
experimented with \cite{REJIT}, but this is clearly an area in need of
research and innovative approaches.

-One of the central goals of the PyPy project is to automatically
+One of the central goals of the PyPy project \cite{PyPy} is to automatically
produce dynamic compilers from an interpreter, with as little
modifications of the interpreter code base itself as possible.

-The forest of flow graphs that the translation process \cite{VMCDLS}
-generates and transforms constitutes a reasonable base for the
-necessary analyses.  That's a further reason why having a high-level
-runnable and analyzable interpreter implementation was always a
-central tenet of the project: in our approach,
-the dynamic compiler is just another aspect
-transparently introduced by and during the translation
-process.
+PyPy contains a complete interpreter for the Python language, written in
+a high-level language, RPython, which is a subset of Python amenable to
+static analysis.  It also contains a translation toolchain for compiling
+this interpreter to either C (or C-like) environments, or to the higher
+level environments provided by general-purpose virtual machines like
+Java's and .NET.  The translation toolchain can input any RPython
+program, although our focus was on the translation of RPython programs
+that are interpreters for dynamic languages.\footnote{We also have an
+interpreter for Prolog and the beginning of one for JavaScript.}
+
+The translation framework uses control flow graphs in SSI format as its
+intermediate representation (SSI is a stricter subset of SSA).  The
+details of this process are beyond the scope of the present paper, and
+have been presented in \cite{pypyvmconstruction}.
+The present paper describes a
+special optional transformation that we integrated with this translation
+framework: deriving a dynamic compiler from the interpreter.  In other
+words, our translation framework is able to input an interpreter for any
+language (it works best for dynamic languages); as long as it is
+written in RPython and contains a small number of extra hints, then it
+can produce from it a complete virtual machine
+\emph{that contains a just-in-time compiler for the dynamic language.}

Partial evaluation techniques should, at least theoretically,
-allow such a derivation of a compiler from an interpreter [PE], but it
+allow such a derivation of a compiler from an interpreter
+\cite{partial-evaluation}, but it
is not reasonable to expect the code produced for an input program by
a compiler derived using partial evaluation to be very good,
especially in the case of a dynamic language.  Essentially, the input
@@ -83,9 +98,9 @@
This will allows the compiler to generate code optimized for the
effective run-time behaviour of the program.

-Inspired by Psyco \cite{PSYCO}, which is a hand-written dynamic compiler
+Inspired by Psyco \cite{psyco-paper}, which is a hand-written dynamic compiler
based on partial evaluation for Python, we developed a technique -
-*promotion* - for our dynamic compiler generator. Simply put, promotion
+\emph{promotion} - for our dynamic compiler generator. Simply put, promotion
on a value stops compilation and waits until the run-time reaches this
point.  When it does, the actual run-time value is promoted into a
compile-time constant, and compilation resumes with this extra
@@ -102,98 +117,113 @@

\subsection{Overview of partial evaluation}

-Partial evaluation is the process of evaluating a function, say f(x,
-y), with only partial information about the values of its arguments,
-say the value of the x argument only.  This produces a *residual*
-function g(y), which takes less arguments than the original -- only
+\def\code#1{\texttt{#1}}
+
+Partial evaluation is the process of evaluating a function, say \code{f(x,
+y)}, with only partial information about the values of its arguments,
+say the value of the \code{x} argument only.  This produces a \emph{residual}
+function \code{g(y)}, which takes less arguments than the original -- only
the information not specified during the partial evaluation process needs
-to be provided to the residual function, in this example the y
+to be provided to the residual function, in this example the \code{y}
argument.

Partial evaluation (PE) comes in two flavors:
+%
+\begin{enumerate}

-* *On-line* PE: a compiler-like algorithm takes the source code of the
-  function f(x, y) (or its intermediate representation, i.e. its
+\item\emph{On-line PE:} a compiler-like algorithm takes the source code of the
+  function \code{f(x, y)} (or its intermediate representation, i.e.\ its
control flow graph in PyPy's terminology), and some partial
-  information, e.g. x=5.  From this, it produces the residual
-  function g(y) directly, by following in which operations the
-  knowledge x=5 can be used, which loops can be unrolled, etc.
+  information, e.g.\ \code{x=5}.  From this, it produces the residual
+  function \code{g(y)} directly, by following in which operations the
+  knowledge \code{x=5} can be used, which loops can be unrolled, etc.

-* *Off-line* PE: in many cases, the goal of partial evaluation is to
+\item\emph{Off-line PE:} in many cases, the goal of partial evaluation is to
improve performance in a specific application.  Assume that we have a
-  single known function f(x, y) in which we think that the value of
-  x will change slowly during the execution of our program -- much
-  more slowly than the value of y.  An obvious example is a loop
-  that calls f(x, y) many times with always the same value x.
-  We could then use an on-line partial evaluator to produce a g(y)
-  for each new value of x.  In practice, the overhead of the partial
+  single known function \code{f(x, y)} in which we think that the value of
+  \code{x} will change slowly during the execution of our program -- much
+  more slowly than the value of \code{y}.  An obvious example is a loop
+  that calls \code{f(x, y)} many times with always the same value \code{x}.
+  We could then use an on-line partial evaluator to produce a \code{g(y)}
+  for each new value of \code{x}.  In practice, the overhead of the partial
evaluator might be too large for it to be executed at run-time.
-  However, if we know the function f in advance, and if we know
-  *which* arguments are the ones that we will want to partially evaluate
-  f with, then we do not need a full compiler-like analysis of f
-  every time the value of x changes.  We can precompute once and for
-  all a specialized function f1(x), which when called produces the
-  residual function g(y) corresponding to x.  This is *off-line
-  partial evaluation;* the specialized function f1(x) is called a
-  *generating extension*.
+  However, if we know the function \code{f} in advance, and if we know
+  \emph{which} arguments are the ones that we will want to partially evaluate
+  \code{f} with, then we do not need a full compiler-like analysis of \code{f}
+  every time the value of \code{x} changes.  We can precompute once and for
+  all a specialized function \code{f1(x)}, which when called produces the
+  residual function \code{g(y)} corresponding to \code{x}.  This is
+  \emph{off-line partial evaluation;} the specialized function \code{f1(x)}
+  is called a \emph{generating extension.}
+
+\end{enumerate}

The PyPy JIT generation framework is based on off-line partial
-evaluation.  The function called f(x, y) above is typically the main
+evaluation.  The function called \code{f(x, y)} above is typically the main
loop of some interpreter written in RPython.  The size of the interpreter can range
from a three-liner used for testing purposes to the whole of PyPy's
-Python interpreter.  In all cases, x stands for the input program
-(the bytecode to interpret) and y stands for the input data (like a
+Python interpreter.  In all cases, \code{x} stands for the input program
+(the bytecode to interpret) and \code{y} stands for the input data (like a
frame object with the binding of the input arguments and local
variables).  Our framework is capable of automatically producing the
-corresponding generating extension f1(x), which takes an input
-program only and produces a residual function g(y).  This f1(x)
+corresponding generating extension \code{f1(x)}, which takes an input
+program only and produces a residual function \code{g(y)}.  This \code{f1(x)}
is a compiler\footnote{
What we get in PyPy is more precisely a \emph{just-in-time compiler:}
if promotion is used, compiling ahead of time is not possible.
}
-for the very same language for which f(x, y) is
+for the very same language for which \code{f(x, y)} is
an interpreter.

-Off-line partial evaluation is based on *binding-time analysis,* which
+Off-line partial evaluation is based on \emph{binding-time analysis,} which
is the process of determining among the variables used in a function (or
a set of functions) which ones are going to be known in advance and
-which ones are not.  In the example of f(x, y), such an analysis
-would be able to infer that the constantness of the argument x
+which ones are not.  In the example of \code{f(x, y)}, such an analysis
+would be able to infer that the constantness of the argument \code{x}
implies the constantness of many intermediate values used in the
-function.  The *binding time* of a variable determines how early the
+function.  The \emph{binding time} of a variable determines how early the
value of the variable will be known.

Once binding times have been determined, one possible approach to
producing the generating extension itself is by self-applying on-line
partial evaluators.  This is known as the second Futamura projection
-\cite{FU}.  So far it is unclear if this approach can lead to optimal
+\cite{Futamura}.  So far it is unclear if this approach can lead to optimal
results, or even if it scales well.  In PyPy we selected a more direct
approach: the generating extension is produced by transformation of the
control flow graphs of the interpreter, guided by the binding times.  We
-call this process *timeshifting*.
+call this process \emph{timeshifting.}
+
+
+\subsection{Related work}
+
+XXX PE; Psyco; REJIT; ?

\section{Architecture and Principles}

PyPy contains a framework for generating just-in-time compilers using
off-line partial evaluation.  As such, there are three distinct phases:
+%
+\begin{enumerate}

-* *Translation time:* during the normal translation of an RPython
+\item\emph{Translation time:} during the normal translation of an RPython
program, say PyPy's Python interpreter, we perform binding-time
analysis and off-line specialization ("timeshifting") of the
interpreter.  This produces a generating extension, which is linked
with the rest of the program.

-* *Compile time:* during the execution of the program, when a new
+\item\emph{Compile time:} during the execution of the program, when a new
bytecode is about to be interpreted, the generating extension is
invoked instead.  As the generating extension is a compiler, all the
computations it performs are called compile-time computations.  Its
sole effect is to produce residual code.

-* *Run time:* the normal execution of the program (which includes the
+\item\emph{Run time:} the normal execution of the program (which includes the
time spent running the residual code created by the generating
extension).

+\end{enumerate}
+
Translation time is a purely off-line phase; compile time and run time
are actually highly interleaved during the execution of the program.

@@ -202,33 +232,37 @@
\label{bta}

At translation time, PyPy performs binding-time analysis of the source
-RPython program after it has been turned to low-level graphs, i.e. at
+RPython program after it has been turned to low-level graphs, i.e.\ at
the level at which operations manipulate pointer-and-structure-like
objects.

The binding-time terminology that we are using in PyPy is based on the
colors that we use when displaying the control flow graphs:
-
-* *Green* variables contain values that are known at compile-time;
-* *Red* variables contain values that are not known until run-time.
+%
+\begin{itemize}
+\item\emph{Green} variables contain values that are known at compile-time;
+\item\emph{Red} variables contain values that are not known until run-time.
+\end{itemize}

The binding-time analyzer of our translation tool-chain is based on the
same type inference engine that is used on the source RPython program,
-the annotator.  In this mode, it is called the *hint-annotator*; it
+the annotator.  In this mode, it is called the \emph{hint-annotator;} it
RPython-level, and propagates annotations that do not track types but
value dependencies and manually-provided binding time hints.

The normal process of the hint-annotator is to propagate the binding
-time (i.e. color) of the variables using the following kind of rules:
+time (i.e.\ color) of the variables using the following kind of rules:
+%
+\begin{itemize}

-* For a foldable operation (i.e. one without side effect and which
+\item For a foldable operation (i.e.\ one without side effect and which
depends only on its argument values), if all arguments are green,
then the result can be green too.

-* Non-foldable operations always produce a red result.
+\item Non-foldable operations always produce a red result.

-* At join points, where multiple possible values (depending on control
+\item At join points, where multiple possible values (depending on control
flow) are meeting into a fresh variable, if any incoming value comes
from a red variable, the result is red.  Otherwise, the color of the
result might be green.  We do not make it eagerly green, because of
@@ -238,6 +272,8 @@
fresh join variable thus depends on which branches are taken in the
residual graph.

+\end{itemize}
+
\subsubsection*{Hints}

Our goal in designing our approach to binding-time analysis was to
@@ -248,24 +284,25 @@
The driving idea was that hints should be need-oriented.  Indeed, in a
program like an interpreter, there are a small number of places where it
would be clearly beneficial for a given value to be known at
-compile-time, i.e. green: this is where we require the hints to be
+compile-time, i.e.\ green: this is where we require the hints to be

The hint-annotator assumes that all variables are red by default, and
then propagates annotations that record dependency information.
When encountering the user-provided hints, the dependency information
is used to make some variables green.  All
-hints are in the form of an operation hint(v1, someflag=True)
+hints are in the form of an operation \code{hint(v1, someflag=True)}
which semantically just returns its first argument unmodified.

-The crucial need-oriented hint is v2 = hint(v1, concrete=True)
+The crucial need-oriented hint is
+$$\code{v2 = hint(v1, concrete=True)}$$
which should be used in places where the programmer considers the
knowledge of the value to be essential.  This hint is interpreted by
-the hint-annotator as a request for both v1 and v2 to be green.  It
-has a *global* effect on the binding times: it means that not only
-v1 but all the values that v1 depends on -- recursively --
+the hint-annotator as a request for both \code{v1} and \code{v2} to be green.  It
+has a \emph{global} effect on the binding times: it means that not only
+\code{v1} but all the values that \code{v1} depends on -- recursively --
are forced to be green.  The hint-annotator complains if the
-dependencies of v1 include a value that cannot be green, like
+dependencies of \code{v1} include a value that cannot be green, like
a value read out of a field of a non-immutable structure.

Such a need-oriented backward propagation has advantages over the
@@ -276,22 +313,23 @@
of the residual code), or less variables than expected (preventing
specialization to occur where it would be the most useful).  Our
need-oriented approach reduces the problem of over-specialization, and
-it prevents under-specialization: an unsatisfiable hint(v1,
-concrete=True) is reported as an error.
+it prevents under-specialization: an unsatisfiable \code{hint(v1,
+concrete=True)} is reported as an error.

In our context, though, such an error can be corrected.  This is done by
-promoting a well-chosen variable among the ones that v1 depends on.
+promoting a well-chosen variable among the ones that \code{v1} depends on.

Promotion is invoked with the use of a hint as well:
-v2 = hint(v1, promote=True).
-This hint is a *local* request for v2 to be green, without
-requiring v1 to be green.  Note that this amounts to copying
+\code{v2 = hint(v1, promote=True)}.
+This hint is a \emph{local} request for \code{v2} to be green, without
+requiring \code{v1} to be green.  Note that this amounts to copying
a red value into a green one, which is not possible in classical
approaches to partial evaluation.  See section \ref{promotion} for a
complete discussion of promotion.

For examples and further discussion on how the hints are applied in practice
-see Make your own JIT compiler \cite{D08.1}.
+see \emph{Make your own JIT compiler} at
+\code{http://codespeak.net/pypy/dist/pypy/doc/jit.html}. % XXX check url

\subsection{Timeshifting}

@@ -307,7 +345,7 @@
cannot be expressed as low-level flow graphs).
}
accordingly in order to produce a generating extension.  We call
-this process *timeshifting* because it changes the time at
+this process \emph{timeshifting} because it changes the time at
which the graphs are meant to be run, from run-time to compile-time.

Despite the execution time and side-effects shift to produce only
@@ -330,32 +368,40 @@
The basic idea of timeshifting is to transform operations in a way that
depends on the color of their operands and result.  Variables themselves
need to be represented based on their color:
+%
+\begin{itemize}

-* The red (run-time) variables have abstract values at compile-time;
+\item The red (run-time) variables have abstract values at compile-time;
no actual value is available for them during compile-time. For them
we use a boxed representation that can carry either a run-time storage
location (a stack frame position or a register name) or an immediate
constant (for when the value is, after all, known at compile-time).

-* On the other hand, the green variables are the ones that can carry
+\item On the other hand, the green variables are the ones that can carry
their value already at compile-time, so they are left untouched during
timeshifting.

+\end{itemize}
+
The operations of the original graphs are then transformed as follows:
+%
+\begin{itemize}

-* If an operation has no side effect nor any other run-time dependency, and
+\item If an operation has no side effect nor any other run-time dependency, and
if it only involves green operands, then it can stay unmodified in the
graph.  In this case, the operation that was run-time in the original
graph becomes a compile-time operation, and it will never be generated
in the residual code.  (This is the case that makes the whole approach
worthwhile: some operations become purely compile-time.)

-* In all other cases, the operation might have to be generated in the
+\item In all other cases, the operation might have to be generated in the
residual code.  In the timeshifted graph it is replaced by a call
to a helper which will generate a residual operation manipulating
the input run-time values and return a new boxed representation
for the run-time result location.

+\end{itemize}
+
These helpers will constant-fold the operation if the inputs
are immediate constants and if the operation has no side-effects. Immediate constants can occur even though the
corresponding variable in the graph was red: a variable can be
@@ -363,7 +409,7 @@
point in (compile)-time, independently of the hint-annotator
proving that it is always the case.
In Partial Evaluation terminology, the timeshifted graphs are
-performing some *on-line* partial evaluation in addition to the
+performing some \emph{on-line} partial evaluation in addition to the
off-line job enabled by the hint-annotator.

\subsubsection*{Merges and Splits}
@@ -373,16 +419,16 @@
This state is used to shape the control flow of the generated residual
code, as follows.

-After a *split,* i.e. after a conditional branch that could not be
+After a \emph{split,} i.e.\ after a conditional branch that could not be
folded at compile-time, the compilation state is duplicated and both
-branches are compiled independently.  Conversely, after a *merge point,*
-i.e. when two control flow paths meet each other, we try to join the two
+branches are compiled independently.  Conversely, after a \emph{merge point,}
+i.e.\ when two control flow paths meet each other, we try to join the two
paths in the residual code.  This part is more difficult because the two
-paths may need to be compiled with different variable bindings -- e.g.
-different variables may be known to take different compile-time constant
+paths may need to be compiled with different variable bindings --
+e.g.\ different variables may be known to take different compile-time constant
values in the two branches.  The two paths can either be kept separate
or merged; in the latter case, the merged compilation-time state needs
-to be a generalization (*widening*) of the two already-seen states.
+to be a generalization \emph{(widening)} of the two already-seen states.
Deciding when to do each is a classical problem of partial evaluation,
as merging too eagerly may loose important precision and not merging
eagerly enough may create too many redundant residual code paths (to the
@@ -414,7 +460,7 @@
\label{promotion}

In the sequel, we describe in more details one of the main new
-techniques introduced in our approach, which we call *promotion*.  In
+techniques introduced in our approach, which we call \emph{promotion.}  In
short, it allows an arbitrary run-time value to be turned into a
compile-time value at any point in time.  Each promotion point is
explicitly defined with a hint that must be put in the source code of
@@ -425,7 +471,7 @@
copying a variable whose binding time is compile-time into a variable
whose binding time is run-time -- it corresponds to the compiler
"forgetting" a particular value that it knew about.  By contrast,
-promotion is a way for the compiler to gain *more* information about
+promotion is a way for the compiler to gain \emph{more} information about
the run-time execution of a program. Clearly, this requires
fine-grained feedback from run-time to compile-time, thus a
dynamic setting.
@@ -457,7 +503,8 @@
techniques are crucial for good results.  The main goal is to
optimize and reduce the overhead of dynamic dispatching and indirect
invocation.  This is achieved with variations on the technique of
-polymorphic inline caches \cite{PIC}: the dynamic lookups are cached and
+polymorphic inline caches \cite{polymorphic-inline-caches}:
+the dynamic lookups are cached and
the corresponding generated machine code contains chains of
compare-and-jump instructions which are modified at run-time.  These
techniques also allow the gathering of information to direct inlining for even
@@ -472,7 +519,7 @@
promoted to compile-time.  As we will see in the sequel, this produces
very similar machine code.\footnote{
This can also be seen as a generalization of a partial
-    evaluation transformation called "The Trick" (see e.g. \cite{PE}),
+    evaluation transformation called "The Trick" (see e.g.\ \cite{partial-evaluation}),
which again produces similar code but which is only
applicable for finite sets of values.
}
@@ -486,7 +533,7 @@
\subsubsection*{Promotion in practice}

The implementation of promotion requires a tight coupling between
-compile-time and run-time: a *callback,* put in the generated code,
+compile-time and run-time: a \emph{callback,} put in the generated code,
which can invoke the compiler again.  When the callback is actually
reached at run-time, and only then, the compiler resumes and uses the
knowledge of the actual run-time value to generate more code.
@@ -499,85 +546,86 @@
While this describes the general idea, the details are open to slight
produced by PyPy 1.0 work.  Our first example is purely artificial:
-
+%
\begin{verbatim}
-        ...
-        b = a / 10
-        c = hint(b, promote=True)
-        d = c + 5
-        print d
-        ...
+    ...
+    b = a / 10
+    c = hint(b, promote=True)
+    d = c + 5
+    print d
+    ...
\end{verbatim}

-In this example, a and b are run-time variables and c and
-d are compile-time variables; b is copied into c via a
+In this example, \code{a} and \code{b} are run-time variables and \code{c} and
+\code{d} are compile-time variables; \code{b} is copied into \code{c} via a
promotion.  The division is a run-time operation while the addition is a
compile-time operation.

The compiler derived from an interpreter containing the above code
generates the following machine code (in pseudo-assembler notation),
-assuming that a comes from register r1:
-
+assuming that \code{a} comes from register \code{r1}:
+%
\begin{verbatim}
-     ...
-        r2 = div r1, 10
-     Label1:
-        jump Label2
-        <some reserved space here>
-
-     Label2:
-        call continue_compilation(r2, <state data pointer>)
-        jump Label1
+ ...
+    r2 = div r1, 10
+ Label1:
+    jump Label2
+    <some reserved space here>
+
+ Label2:
+    call continue_compilation(r2, <state data ptr>)
+    jump Label1
\end{verbatim}

-The first time this machine code runs, the continue\_compilation()
-function resumes the compiler.  The two arguments to the function are
-the actual run-time value from the register r2, which the compiler
+The first time this machine code runs, the function called
+\code{continue\_compilation()}
+resumes the compiler.  The two arguments to the function are
+the actual run-time value from the register \code{r2}, which the compiler
will now consider as a compile-time constant, and an immediate pointer
to data that was generated along with the above code snippet and which
contains enough information for the compiler to know where and with
which state it should resume.

-Assuming that the first run-time value taken by r1 is, say, 42, then
-the compiler will see r2 == 4 and update the above machine code as
+Assuming that the first run-time value taken by \code{r1} is, say, 42, then
+the compiler will see \code{r2 == 4} and update the above machine code as
follows:
-
+%
\begin{verbatim}
-     ...
-        r2 = div r1, 10
-     Label1:
-        compare r2, 4            # patched
-        jump-if-equal Label3     # patched
-        jump Label2              # patched
-        <less reserved space left>
-
-     Label2:
-        call continue_compilation(r2, <state data pointer>)
-        jump Label1
-
-     Label3:                     # new code
-        call print(9)            # new code
-        ...
+ ...
+    r2 = div r1, 10
+ Label1:
+    compare r2, 4            # patched
+    jump-if-equal Label3     # patched
+    jump Label2              # patched
+    <less reserved space left>
+
+ Label2:
+    call continue_compilation(r2, <state data ptr>)
+    jump Label1
+
+ Label3:                     # new code
+    call print(9)            # new code
+    ...
\end{verbatim}

Notice how the addition is constant-folded by the compiler.  (Of course,
in real examples, different promoted values typically make the compiler
constant-fold complex code path choices in different ways, and not just
-simple operations.)  Note also how the code following Label1 is an
+simple operations.)  Note also how the code following \code{Label1} is an
updatable switch which plays the role of a polymorphic inline cache.
The "polymorphic" terminology does not apply in our context, though, as
the switch does not necessarily have to be on the type of an object.

-After the update, the original call to continue\_compilation()
+After the update, the original call to \code{continue\_compilation()}
returns and execution loops back to the now-patched switch at
-Label1.  This run and all following runs in which r1 is between
-40 and 49 will thus directly go to Label3.  Obviously, if other
-values show up, continue\_compilation() will be invoked again, so new
-code will be generated and the code at Label1 further patched to
+\code{Label1}.  This run and all following runs in which \code{r1} is between
+40 and 49 will thus directly go to \code{Label3}.  Obviously, if other
+values show up, \code{continue\_compilation()} will be invoked again, so new
+code will be generated and the code at \code{Label1} further patched to
check for more cases.

If, over the course of the execution of a program, too many cases are
-seen, the reserved space after Label1 will eventually run out.
+seen, the reserved space after \code{Label1} will eventually run out.
Currently, we simply reserve more space elsewhere and patch the final
jump accordingly.  There could be better strategies which which we did
not implement so far, such as discarding old code and reusing their slots
@@ -587,13 +635,13 @@

\subsubsection*{Implementation notes}

-The *state data pointer* in the example above contains a snapshot of the
+The state data pointer in the example above contains a snapshot of the
state of the compiler when it reached the promotion point.  Its memory
impact is potentially large -- a complete continuation for each generated
switch, which can never be reclaimed because new run-time values may
always show up later during the execution of the program.

-To reduce the problem we compress the state into a so-called *path*.
+To reduce the problem we compress the state into a so-called \emph{path.}
The full state is only stored at a few specific points.\footnote{
More precisely, at merge points that the user needs to mark
as "global".  The control flow join point corresponding to the
@@ -602,15 +650,15 @@
}
The compiler
records a trace of the multiple paths it followed from the last full
-snapshot in a lightweight tree structure.  The *state data pointer* is
+snapshot in a lightweight tree structure.  The state data pointer is
then only a pointer to a node in the tree; the branch from that node to
-the root describes a path that let the compiler quickly *replay* its
+the root describes a path that let the compiler quickly \emph{replay} its
actions (without generating code again) from the latest full snapshot to
rebuild its internal state and get back to the original promotion point.

For example, if the interpreter source code contains promotions inside a
run-time condition:
-
+%
\begin{verbatim}
if condition:
...
@@ -625,7 +673,7 @@
then the tree will contain three nodes: a root node storing the
snapshot, a child with a "True case" marker, and another child with a
"False case" marker.  Each promotion point generates a switch and a call
-to continue\_compilation() pointing to the appropriate child node.
+to \code{continue\_compilation()} pointing to the appropriate child node.
The compiler can re-reach the correct promotion point by following the
markers on the branch from the root to the child.

@@ -642,7 +690,7 @@
of fresh variables, one per field.  In the compiler, the variable that
would normally contain the pointer to the structure gets instead a
content that is neither a run-time value nor a compile-time constant,
-but a special *virtual structure* -- a compile-time data structure that
+but a special \emph{virtual structure} -- a compile-time data structure that
recursively contains new variables, each of which can again store a
run-time, a compile-time, or a virtual structure value.

@@ -650,54 +698,54 @@
around by the compiler really represent run-time locations -- the name of
a CPU register or a position in the machine stack frame.  This is the
case for both regular variables and the fields of virtual structures.
-It means that the compilation of a getfield or setfield
+It means that the compilation of a \code{getfield} or \code{setfield}
operation performed on a virtual structure simply loads or stores such a
location reference into the virtual structure; the actual value is not
copied around at run-time.

It is not always possible to keep structures virtual.  The main
-situation in which it needs to be "forced" (i.e. actually allocated at
+situation in which it needs to be "forced" (i.e.\ actually allocated at
run-time) is when the pointer escapes to some non-virtual location like
a field of a real heap structure.

Virtual structures still avoid the run-time allocation of most
short-lived objects, even in non-trivial situations.  The following
-example shows a typical case.  Consider the Python expression a+b+c.
-Assume that a contains an integer.  The PyPy Python interpreter
+example shows a typical case.  Consider the Python expression \code{a+b+c}.
+Assume that \code{a} contains an integer.  The PyPy Python interpreter
implements application-level integers as boxes -- instances of a
-W\_IntObject class with a single intval field.  Here is the
+\code{W\_IntObject} class with a single \code{intval} field.  Here is the
-
+%
\begin{verbatim}
-    def add(w1, w2):            # w1, w2 are W_IntObject instances
-        value1 = w1.intval
-        value2 = w2.intval
-        result = value1 + value2
-        return W_IntObject(result)
+  def add(w1, w2):          # w1, w2 are instances
+      value1 = w1.intval    # of W_IntObject
+      value2 = w2.intval
+      result = value1 + value2
+      return W_IntObject(result)
\end{verbatim}

-When interpreting the bytecode for a+b+c, two calls to add() are
-issued; the intermediate W\_IntObject instance is built by the first
+When interpreting the bytecode for \code{a+b+c}, two calls to \code{add()} are
+issued; the intermediate \code{W\_IntObject} instance is built by the first
call and thrown away after the second call.  By contrast, when the
interpreter is turned into a compiler, the construction of the
-W\_IntObject object leads to a virtual structure whose intval
+\code{W\_IntObject} object leads to a virtual structure whose \code{intval}
field directly references the register in which the run-time addition
put its result.  This location is read out of the virtual structure at
-the beginning of the second add(), and the second run-time addition
directly operates on the same register.

An interesting effect of virtual structures is that they play nicely with
-promotion.  Indeed, before the interpreter can call the proper add()
+promotion.  Indeed, before the interpreter can call the proper \code{add()}
function for integers, it must first determine that the two arguments
are indeed integer objects.  In the corresponding dispatch logic, we
have added two hints to promote the type of each of the two arguments.
This produces a compiler that has the following behavior: in the general
-case, the expression a+b will generate two consecutive run-time
+case, the expression \code{a+b} will generate two consecutive run-time
switches followed by the residual code of the proper version of
-add().  However, in a+b+c, the virtual structure representing
+\code{add()}.  However, in \code{a+b+c}, the virtual structure representing
the intermediate value will contain a compile-time constant as type.
Promoting a compile-time constant is trivial -- no run-time code is
-generated.  The whole expression a+b+c thus only requires three
+generated.  The whole expression \code{a+b+c} thus only requires three
switches instead of four.  It is easy to see that even more switches can
be skipped in larger examples; typically, in a tight loop manipulating
only integers, all objects are virtual structures for the compiler and
@@ -722,7 +770,7 @@
or dictionary implementing the bindings of the locals.  Then each local
variable of the interpreted language can be represented as a separate
run-time value in the generated code, or be itself further virtualized
-(e.g. as a virtual W\_IntObject structure as seen above).
+(e.g.\ as a virtual \code{W\_IntObject} structure as seen above).

The issue is that the frame object is sometimes built in advance by
non-JIT-generated code; even when it is not, it immediately escapes into
@@ -732,12 +780,12 @@
into a global data structure (even though in practice most of frame
objects are deallocated without ever having been introspected).

-To solve this problem, we introduced *virtualizable structures,* a mix
+To solve this problem, we introduced \emph{virtualizable structures,} a mix
between regular run-time structures and virtual structures.  A virtualizable structure is a
structure that exists at run-time in the heap, but that is
simultaneously treated as virtual by the compiler.  Accesses to the
structure from the code generated by the JIT are virtualized away,
-i.e.  don't involve run-time copying.  The trade-off is that in order
+i.e.\ don't involve run-time copying.  The trade-off is that in order
to keep both views synchronized, accesses to the run-time structure
from regular code not produced by the JIT needs to perform an extra
check.
@@ -776,92 +824,52 @@

We quickly mention below a few other features and implementation details
-can be found in the on-line documentation.
+can be found in the on-line documentation \cite{PyPy}.  % => ref to web site
+%
+\begin{itemize}

-* There are more user-specified hints available, like *deep-freezing,*
+\item There are more user-specified hints available, like \emph{deep-freezing,}
which marks an object as immutable in order to allow accesses to
its content to be constant-folded at compile-time.

-* The compiler representation of a run-time value for a non-virtual
+\item The compiler representation of a run-time value for a non-virtual
structure may additionally remember that some fields are actually
compile-time constants.  This occurs for example when a field is
read from the structure at run-time and then promoted to compile-time.

-* In addition to virtual structures, lists and dictionaries can also be
+\item In addition to virtual structures, lists and dictionaries can also be
virtual.

-* Exception handling is achieved by inserting explicit operations into
+\item Exception handling is achieved by inserting explicit operations into
the graphs before they are timeshifted.  Most of these run-time
exception manipulations are then virtualized away, by treating the
exception state as virtual.

-* Timeshifting is performed in two phases: a first step transforms the
+\item Timeshifting is performed in two phases: a first step transforms the
graphs by updating their control flow and inserting pseudo-operations
to drive the compiler; a second step (based on the RTyper \cite{D05.1})
replaces all necessary operations by calls to support code.

-* The support code implements the generic behaviour of the compiler,
-  e.g. the merge logic.  It is about 3500 lines of RPython code.  The
+\item The support code implements the generic behaviour of the compiler,
+  e.g.\ the merge logic.  It is about 3500 lines of RPython code.  The
rest of the hint-annotator and timeshifter is about 3800 lines of
Python code.

-* The machine code backends (two so far, Intel IA32 and PowerPC) are
+\item The machine code backends (two so far, Intel IA32 and PowerPC) are
about 3500 further lines of RPython code each.  There is a
well-defined interface between the JIT compiler support code and the
backends, making writing new backends relatively easy.  The unusual
part of the interface is the support for the run-time updatable
switches.

-
-\subsection{Open issues}
-
-Here are what we think are the most important points that will need
-attention in order to make the approach more robust:
-
-* The timeshifted graphs currently compile many branches eagerly.  This
-  can easily result in residual code explosion.  Depending on the source
-  interpreter this can also result in non-termination issues, where
-  compilation never completes.  The opposite extreme would be to always
-  compile branches lazily, when they are about to be executed, as Psyco
-  does.  While this neatly sidesteps termination issues, the best
-  solution is probably something in between these extremes.
-
-* As described in the Promotion section (\ref{promotion}),
-  we need fall-back solutions for when the
-  number of promoted run-time values seen at a particular point becomes
-  too large.
-
-* We need more flexible control about what to inline or not to inline in
-  the residual code.
-
-* The widening heuristics for merging needs to be refined.
-
-* The JIT generation framework needs to be made aware of some other
-  translation-time aspects \cite{D05.4} \cite{D07.1} in order to produce the
-  correct residual code (e.g. code calling the correct Garbage
-  Collection routines or supporting Stackless-style stack unwinding).
-
-* We did not work yet on profile-directed identification of program hot
-  spots.  Currently, the interpreter must decide when to invoke the JIT
-  or not (which can itself be based on explicit requests from the interpreted
-  program).
-
-* The machine code backends can be improved.
-
-The latter point opens an interesting future research direction: can we
-layer our kind of JIT compiler on top of a virtual machine that already
-contains a lower-level JIT compiler?  In other words, can we delegate
-the difficult questions of machine code generation to a lower
-independent layer, e.g. inlining, re-optimization of frequently executed
-code, etc.?  What changes would be required to an existing virtual
-machine, e.g. a Java Virtual Machine, to support this?
+\end{itemize}

\section{Results}

The following test function is an example of purely arithmetic code
written in Python, which the PyPy JIT can run extremely fast:
-
+%
\begin{verbatim}
def f1(n):
"Arbitrary test function."
@@ -876,39 +884,40 @@
return x
\end{verbatim}

-We measured the time required to compute f1(2117) on the following
+We measured the time required to compute \code{f1(2117)} on the following
interpreters:
+%
+\begin{itemize}

-* Python 2.4.4, the standard CPython implementation.
+\item Python 2.4.4, the standard CPython implementation.

-* A version of pypy-c including a generated JIT compiled.
+\item A version of pypy-c (our Python interpreter translated to a stand-alone
+  executable via C) including a generated JIT compiled.

-* gcc 4.1.1 compiling the above function rewritten in C (which, unlike
+\item gcc 4.1.1 compiling the above function rewritten in C (which, unlike
the other two, does not do any overflow checking on the arithmetic
operations).

+\end{itemize}
+
The relative results have been found to vary by 25\% depending on the
machine.  On our reference benchmark machine, a 4-cores Intel(R)
Xeon(TM) CPU 3.20GHz with 5GB of RAM, we obtained the following results
(the numbers in parenthesis are the slow-down ratio relative to the
unoptimized gcc compilation):

-+-----------------------------------------+------------------+
-| Interpreter                             | Seconds per call |
-+=========================================+==================+
-| Python 2.4.4                            | 0.82    (132x)   |
-+-----------------------------------------+------------------+
-| Python 2.4.4 with Psyco 1.5.2           | 0.0062  (1.00x)  |
-+-----------------------------------------+------------------+
-| pypy-c with the JIT turned off          | 1.77    (285x)   |
-+-----------------------------------------+------------------+
-| pypy-c with the JIT turned on           | 0.0091  (1.47x)  |
-+-----------------------------------------+------------------+
-| gcc                                     | 0.0062  (1x)     |
-+-----------------------------------------+------------------+
-| gcc -O2                                 | 0.0022  (0.35x)  |
-+-----------------------------------------+------------------+
-
+\begin{tabular}{|l|ll|}
+\hline
+Interpreter & \multicolumn{2}{|c|}{Seconds per call} \\
+\hline
+Python 2.4.4                            & 0.82   & (132x)   \\
+Python 2.4.4 with Psyco 1.5.2           & 0.0062 & (1.00x)  \\
+pypy-c with the JIT turned off          & 1.77   & (285x)   \\
+pypy-c with the JIT turned on           & 0.0091 & (1.47x)  \\
+gcc                                     & 0.0062 & (1x)     \\
+gcc -O2                                 & 0.0022 & (0.35x)  \\
+\hline
+\end{tabular}

This table shows that the PyPy JIT is able to generate residual code
that runs within the same order of magnitude as an unoptimizing gcc.  It
@@ -932,6 +941,54 @@
as 1.15x.

+\section{Future work}
+
+Here are what we think are the most important points that will need
+attention in order to make the approach more robust:
+%
+\begin{itemize}
+
+\item The timeshifted graphs currently compile many branches eagerly.  This
+  can easily result in residual code explosion.  Depending on the source
+  interpreter this can also result in non-termination issues, where
+  compilation never completes.  The opposite extreme would be to always
+  compile branches lazily, when they are about to be executed, as Psyco
+  does.  While this neatly sidesteps termination issues, the best
+  solution is probably something in between these extremes.
+
+\item As described in the Promotion section (\ref{promotion}),
+  we need fall-back solutions for when the
+  number of promoted run-time values seen at a particular point becomes
+  too large.
+
+\item We need more flexible control about what to inline or not to inline in
+  the residual code.
+
+\item The widening heuristics for merging needs to be refined.
+
+\item The JIT generation framework needs to be made aware of some other
+  translation-time aspects in order to produce the correct residual code
+  (e.g.\ code calling the correct Garbage Collection routines or
+  supporting Stackless-style stack unwinding \cite{D07.1}).
+
+\item We did not work yet on profile-directed identification of program hot
+  spots.  Currently, the interpreter must decide when to invoke the JIT
+  or not (which can itself be based on explicit requests from the interpreted
+  program).
+
+\item The machine code backends can be improved.
+
+\end{itemize}
+
+The latter point opens an interesting future research direction: can we
+layer our kind of JIT compiler on top of a virtual machine that already
+contains a lower-level JIT compiler?  In other words, can we delegate
+the difficult questions of machine code generation to a lower
+independent layer, e.g.\ inlining, re-optimization of frequently executed
+code, etc.?  What changes would be required to an existing virtual
+machine, e.g.\ a Java Virtual Machine, to support this?
+
+
\section{Conclusion}

Producing the results described in the previous section requires the
@@ -943,8 +1000,8 @@
boxing and to propagate them in the CPU stack and registers.

Some slight reorganisation of the interpreter main loop without semantics
-influence, marking the frames as virtualizable (\ref{virtualizable}),
+influence, marking the frames as virtualizable
+(section \ref{virtualizable}), and adding hints at
a few crucial points was all that was necessary for our Python
interpreter.

@@ -957,9 +1014,24 @@
compiler would be robust against language changes up to the need to
maintain and possibly change the hints.

-We consider this as a major breakthrough in term of the possibilities
-it opens for language design and implementation; it was one of the
-main goals of the research program within the PyPy project.
+We consider this as a major breakthrough in term of the possibilities it
+opens for language design and implementation; it was one of the main
+goals of the research program within the PyPy project.  Only groups with
+very large amounts of resources can affort the high costs of writing
+just-in-time compilers from scratch.  Communities with limited available
+resources for the implementation and maintenance of a language, such as,
+generally, academic and open source projects, cannot afford such costs
+-- and even when experimental just-in-time compilers exist, the mere
+fact of having to maintain them in parallel with other implementations
+is taxing for such communities, particularly when the languages in
+question evolve quickly.  In the PyPy approach, from a single simple
+implementation for the language, we can generate stand-alone virtual
+machines whose performance far excess that of traditional hand-written
+virtual machines (like CPython, the reference C implementation of
+Python); with the generation of a dynamic compiler, we achieve
+state-of-the-art performance.
+
+% XXX balance columns

%.. References (title not necessary, latex generates it)
@@ -975,7 +1047,7 @@
%.. [D08.1] Release a JIT Compiler for PyPy Including Processor Backends
%           for Intel and PowerPC, PyPy EU-Report, 2007
%
-%.. [FU]    Partial evaluation of compuation process -- an approach to a
+%.. [FU]    Partial evaluation of computation process -- an approach to a
%           compiler-compiler, Yoshihito Futamura, Higher-Order and
%           Systems Computers Controls 2(5), 1971
@@ -1003,8 +1075,7 @@
%           conference on Object-oriented programming languages, systems, and
%           applications, pp. 944-953, ACM Press, 2006

-\bigskip
-
+% ---- Bibliography ----
\bibliographystyle{abbrv}
\bibliography{paper}

`