cfbolz at codespeak.net cfbolz at codespeak.net
Mon Sep 27 14:32:19 CEST 2010

Author: cfbolz
Date: Mon Sep 27 14:32:18 2010
New Revision: 77409

Modified:
Log:
refactor a number of things, add many XXXs

==============================================================================
+++ pypy/extradoc/talk/pepm2011/paper.tex	Mon Sep 27 14:32:18 2010
@@ -91,8 +91,12 @@

XXX drop the word "allocation removal" somewhere

+XXX define "escape analysis"
+
\section{Introduction}

+XXX need to re-target introduction a bit to fit PEPMs focus
+
The goal of a just-in-time compiler for a dynamic language is obviously to
improve the speed of the language over an implementation of the language that
uses interpretation. The first goal of a JIT is thus to remove the
@@ -122,8 +126,9 @@

A recently popular approach to implementing just-in-time compilers for dynamic
languages is that of a tracing JIT. A tracing JIT often takes the form of an
-extension to an existing interpreter, which can be sped up that way. The PyPy
-project is an environment for implementing dynamic programming languages. It's
+extension to an existing interpreter, which can be sped up that way. This
+approach is also the one taken by the PyPy project, which is an environment for
+implementing dynamic programming languages. PyPy's
approach to doing so is to straightforwardly implement an interpreter for the
to-be-implemented language, and then use powerful tools to turn the interpreter
into an efficient VM that also contains a just-in-time compiler. This compiler
@@ -152,7 +157,7 @@
The contributions of this paper are:

\begin{enumerate}
-    \item An efficient and effective algorithm for removing objects allocations in a tracing JIT.
+    \item An efficient and effective algorithm for removing object allocations in a tracing JIT.
\item XXX
\end{enumerate}

@@ -232,6 +237,7 @@
on. These guards are the only mechanism to stop the execution of a trace, the
loop end condition also takes the form of a guard.

+bridges?

arguments to traces

@@ -316,10 +322,15 @@
return res
\end{verbatim}

-The loop iterates \texttt{y} times, and computes something in the process. To
-understand the reason why executing this function is slow, here is the trace
-that is produced by the tracing JIT when executing the function with \texttt{y}
-being a \texttt{BoxedInteger}: XXX make it clear that this is really a trace specific for BoxedInteger
+The loop iterates \texttt{y} times, and computes something in the process.
+Simply running this function is slow, because there are lots of virtual method
+calls inside the loop, one for each \texttt{is\_positive} and even two for each
+call to \texttt{add}. These method calls need to check the type of the involved
+objects repeatedly and redundantly. In addition, a lot of objects are created
+when executing that loop, many of these objects do not survive for very long.
+The actual computation that is performed by \texttt{f} is simply a number of
+

\begin{figure}
\begin{verbatim}
@@ -378,17 +389,34 @@
guard_true(i17)
jump(p15, p10)
\end{verbatim}
-\caption{unoptimized trace for the simple object model}
+\label{fig:unopt-trace}
+\caption{Unoptimized Trace for the Simple Object Model}
\end{figure}

-(indentation corresponds to the stack level of the traced functions).
+If the function is executed using the tracing JIT, with \texttt{y} being a
+\texttt{BoxedInteger}, the produced trace looks like
+Figure~\ref{fig:unopt-trace}. The operations in the trace are indented to
+correspond to the stack level of the function that contains the traced
+operation. The trace also shows the inefficiencies of \texttt{f} clearly, if one
+looks at the number of \texttt{new}, \texttt{set/getfield\_gc} and
+\texttt{guard\_class} operations.
+
+Note how the functions that are called by \texttt{f} are automatically inlined
+into the trace. The method calls are always preceded by a \texttt{guard\_class}
+operation, to check that the class of the receiver is the same as the one that
+was observed during tracing.\footnote{\texttt{guard\_class} performs a precise
+class check, not checking for subclasses} These guards make the trace specific
+to the situation where \texttt{y} is really a \texttt{BoxedInteger}, it can
+already be said to be specialized for \texttt{BoxedIntegers}. When the trace is
+turned into machine code and then executed with \texttt{BoxedFloats}, the
+first \texttt{guard\_class} instruction will fail and execution will continue
+using the interpreter.

-The trace is inefficient for a couple of reasons. One problem is that it checks
-repeatedly and redundantly for the class of the objects around, using a
-\texttt{guard\_class} instruction. In addition, some new \texttt{BoxedInteger} instances are
-constructed using the \texttt{new} operation, only to be used once and then forgotten
-a bit later. In the next section, we will see how this can be improved upon,
-using escape analysis.
+XXX simplify traces a bit more
+get rid of \_gc suffix in set/getfield\_gc
+
+In the next section, we will see how this can be improved upon, using escape
+analysis. XXX

\section{Object Lifetimes in a Tracing JIT}
@@ -400,38 +428,53 @@
tracing JIT compiler.

\begin{figure}

\end{figure}

-The figure shows a trace before optimization, together with the lifetime of
-various kinds of objects created in the trace. It is executed from top to
-bottom. At the bottom, a jump is used to execute the same loop another time.
-For clarity, the figure shows two iterations of the loop.
-The loop is executed until one of the guards in the trace fails, and the
-execution is aborted.
-
-Some of the operations within this trace are \texttt{new} operations, which each create a
-new instance of some class. These instances are used for a while, e.g. by
-calling methods on them, reading and writing their fields. Some of these
-instances escape, which means that they are stored in some globally accessible
-place or are passed into a function.
+Figure~\ref{fig:lifetimes} shows a trace before optimization, together with the
+lifetime of various kinds of objects created in the trace. It is executed from
+top to bottom. At the bottom, a jump is used to execute the same loop another
+time (for clarity, the figure shows two iterations of the loop). The loop is
+executed until one of the guards in the trace fails, and the execution is
+aborted and interpretation resumes.
+
+Some of the operations within this trace are \texttt{new} operations, which each
+create a new instance of some class. These instances are used for a while, e.g.
+by calling methods on them (which are inlined into the trace), reading and
+writing their fields. Some of these instances \emph{escape}, which means that
+they are stored in some globally accessible place or are passed into a function.

Together with the \texttt{new} operations, the figure shows the lifetimes of the
-created objects. Objects in category 1 live for a while, and are then just not
-used any more. The creation of these objects is removed by the
-optimization described in the last section.
+created objects. The objects that are created within a trace using \texttt{new}
+fall into one of several categories:

-Objects in category 2 live for a while and then escape. The optimization of the
-last section deals with them too: the \texttt{new} that creates them and
-the field accesses are deferred, until the point where the object escapes.
+\begin{itemize}
+    \item Category 1: Objects that live for a while, and are then just not
+    used any more.
+
+    \item Category 2: Objects that live for a while and then escape.
+
+    \item Category 3: Objects that live for a while, survive across the jump to
+    the beginning of the loop, and are then not used any more.
+
+    \item Category 4: Objects that live for a while, survive across the jump,
+    and then escape. To these we also count the objects that live across several
+    jumps and then either escape or stop being used\footnote{In theory, the
+    approach of Section~\ref{sec:XXX} works also for objects that live for
+    exactly $n>1$ iterations and then don't escape, but we expect this to be a
+    very rare case, so we do not handle it.}
+\end{itemize}
+
+The objects that are allocated in the example trace in
+Figure~\ref{fig:unopt-trace} fall into categories 1 and 3. Objects stored in
+\texttt{p5, p6, p11 XXX} are in category 1, objects in \texttt{p10, p15} are in
+category 3.

-The objects in category 3 and 4 are in principle like the objects in category 1
-and 2. They are created, live for a while, but are then passed as an argument
-to the \texttt{jump} operation. In the next iteration they can either die (category
-3) or escape (category 4).
+The creation of objects in category 1 is removed by the optimization described
+in Section~\ref{sec:virtuals}. XXX

\section{Escape Analysis in a Tracing JIT}
\label{sec:virtuals}
@@ -439,26 +482,29 @@

\subsection{Virtual Objects}

-The main insight to improve the code shown in the last section is that some of
-the objects created in the trace using a \texttt{new} operation don't survive very
-long and are collected by the garbage collector soon after their allocation.
-Moreover, they are used only inside the loop, thus we can easily prove that
-nobody else in the program stores a reference to them. The
-idea for improving the code is thus to analyze which objects never escape the
-loop and may thus not be allocated at all.
+The main insight to improve the code shown in the last section is that objects
+in category 1 don't survive very long and are collected by the garbage collector
+soon after their allocation. Moreover, they are used only inside the loop and
+nobody else in the program stores a reference to them. The idea for improving
+the code is thus to analyze which objects fall in category 1 and which may thus
+not be allocated at all.
+
+XXX is "symbolic execution" the right word to drop?

This process is called \emph{escape analysis}. The escape analysis of
our tracing JIT works by using \emph{virtual objects}: The trace is walked from
beginning to end and whenever a \texttt{new} operation is seen, the operation is
removed and a virtual object is constructed. The virtual object summarizes the
shape of the object that is allocated at this position in the original trace,
-and is used by the escape analysis to improve the trace. The shape describes
+and is used by the optimization to improve the trace. The shapes describe
where the values that would be stored in the fields of the allocated objects
come from. Whenever the optimizer sees a \texttt{setfield} that writes into a virtual
object, that shape summary is thus updated and the operation can be removed.
When the optimizer encounters a \texttt{getfield} from a virtual, the result is read
from the virtual object, and the operation is also removed.

+XXX what happens on a guard\_class?
+
In the example from last section, the following operations would produce two
virtual objects, and be completely removed from the optimized trace:

@@ -511,48 +557,41 @@
values that the virtual object has. This means that instead of the jump, the
following operations are emitted:

-\begin{verbatim}
-p15 = new(BoxedInteger)
-setfield_gc(p15, i14, intval)
-p10 = new(BoxedInteger)
-setfield_gc(p10, i9, intval)
-jump(p15, p10)
-\end{verbatim}
+\texttt{
+\begin{tabular}{l}
+$p_{15}$ = new(BoxedInteger) \\
+setfield\_gc($p_{15}$, $i_{14}$, intval) \\
+$p_{10}$ = new(BoxedInteger) \\
+setfield\_gc($p_{10}$, $i_{9}$, intval) \\
+jump($p_{15}$, $p_{10}$) \\
+\end{tabular}
+}

-Note how the operations for creating these two instances has been moved down the
+Note how the operations for creating these two instances have been moved down the
trace. It looks like for these operations we actually didn't win much, because
the objects are still allocated at the end. However, the optimization was still
worthwhile even in this case, because some operations that have been performed
on the forced virtual objects have been removed (some \texttt{getfield\_gc} operations
and \texttt{guard\_class} operations).

-The final optimized trace of the example looks like this:
-
-\begin{verbatim}
-# arguments to the trace: p0, p1
-guard_class(p1, BoxedInteger)
-i2 = getfield_gc(p1, intval)
-guard_class(p0, BoxedInteger)
-i3 = getfield_gc(p0, intval)
+\begin{figure}
+\includegraphics{figures/step1.pdf}
+\label{fig:step1}
+\caption{Resulting Trace After Allocation Removal}
+\end{figure}

-guard_class(p0, BoxedInteger)
-i12 = getfield_gc(p0, intval)
-
-i17 = int_gt(i14, 0)
-guard_true(i17)
-p15 = new(BoxedInteger)
-setfield_gc(p15, i14, intval)
-p10 = new(BoxedInteger)
-setfield_gc(p10, i9, intval)
-jump(p15, p10)
-\end{verbatim}
+The final optimized trace of the example can be seen in
+Figure~\ref{fig:step1}.

The optimized trace contains only two allocations, instead of the original five,
and only three \texttt{guard\_class} operations, from the original seven.

+\subsection{Algorithm}
+\label{sub:Algorithm}
+
+XXX want some sort of pseudo-code
+
+% subsection Algorithm (end)

%___________________________________________________________________________

@@ -564,6 +603,10 @@
of the type dispatching overhead. In the next section, we will explain how this
optimization can be improved further.

+XXX Category 2 The optimization of
+Section~\ref{sec:virtuals} deals with them too: the \texttt{new} that creates them and
+the field accesses are deferred, until the point where the object escapes.
+
% section Escape Analysis in a Tracing JIT (end)

\section{Escape Analysis Across Loop Boundaries}
@@ -598,10 +641,6 @@
The final trace was much better than the original one, because many allocations
were removed from it. However, it also still contained allocations:

-\begin{figure}
-\includegraphics{figures/step1.pdf}
-\end{figure}
-
The two new \texttt{BoxedIntegers} stored in \texttt{p15} and \texttt{p10} are passed into
the next iteration of the loop. The next iteration will check that they are
indeed \texttt{BoxedIntegers}, read their \texttt{intval} fields and then not use them