cfbolz at codespeak.net cfbolz at codespeak.net
Fri Oct 15 16:47:12 CEST 2010

Author: cfbolz
Date: Fri Oct 15 16:47:10 2010
New Revision: 77993

Modified:
Log:
incorporate first round of comments by stephan

==============================================================================
+++ pypy/extradoc/talk/pepm2011/paper.tex	Fri Oct 15 16:47:10 2010
@@ -106,7 +106,7 @@
The performance of many dynamic language implementations suffers from
high allocation rates and runtime type checks.  This makes dynamic
languages less applicable to purely algorithmic problems, despite their
-growing popularity.  In this paper, we present a simple optimization
+growing popularity.  In this paper we present a simple compiler optimization
based on online partial evaluation to remove object allocations and
runtime type checks in the context of a tracing JIT.  We evaluate the
optimization using a Python VM and find that it gives good results for
@@ -130,7 +130,7 @@

\section{Introduction}

-The goal of a just-in-time (JIT) compiler for a dynamic language is obviously to
+The objective of a just-in-time (JIT) compiler for a dynamic language is to
improve the speed of the language over an implementation of the language that
uses interpretation. The first goal of a JIT is therefore to remove the
interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the
@@ -142,38 +142,37 @@

Boxing of primitive types is necessary because dynamic languages need to be able to handle
all objects, even integers, floats, booleans etc. in the same way as user-defined
-instances. Thus those primitive types are usually \emph{boxed}, i.e. a small
-heap-structure is allocated for them, that contains the actual value. Boxing
+instances. Thus those primitive types are usually \emph{boxed}, \ie a small
+heap-structure is allocated for them that contains the actual value. Boxing
primitive types can be very costly, because a lot of common operations,
-particularly all arithmetic operations, have to produce a new box, in addition
+particularly all arithmetic operations, have to produce new boxes, in addition
to the actual computation they do. Because the boxes are allocated on the heap,
-producing a lot of them puts pressure on the garbage collector.
+producing many of them puts pressure on the garbage collector.

Type dispatching is the process of finding the concrete implementation that is
-applicable to the objects at hand when doing a generic operation on them. An
-example would be the addition of two objects: The addition needs to check what
-the concrete objects that should be added are, and choose the implementation
-that is fitting for them. Type dispatching is a very common operation in
+applicable to the objects at hand when performing a generic operation on them. An
+example would be the addition of two objects: For addition the types of the
+concrete objects need to be checked and the suiting implementation chosen.
+Type dispatching is a very common operation in
modern\footnote{For languages in the LISP family, basic arithmetic operations
are typically not overloaded; even in Smalltalk, type dispatching is much
simpler than in Python or JavaScript.}
-dynamic languages because no types are known at compile time, so all operations
-need it.
+dynamic languages because no types are known at compile time. Therefore all
+operations need it.

A recently popular approach to implementing just-in-time compilers for dynamic
languages is that of a tracing JIT. A tracing JIT works by observing the running
-program and recording its hot spots into linear execution traces. Working on
-traces is the central idea of a tracing JIT. Those traces are optimized and
-turned into machine code.
+program and recording its hot spots into \emph{linear execution traces}. Those
+traces are optimized and turned into machine code.

One reason for the popularity of tracing JITs is their relative
-simplicity. They can often be added to an interpreter and a lot of the
-infrastructure of the interpreter can be reused. They give some important
+simplicity. They can often be added to an existing interpreter, reusing a lot of
+the interpreter's infrastructure. They give some important
-produces linear pieces of code, which simplifies many algorithms that are usually
-hard in a compiler, such as register allocation.
+produces linear pieces of code, which simplifies many of the hard algorithms in
+a compiler, such as register allocation.

-The usage of a tracing JIT can remove the overhead of bytecode dispatch and that
+The use of a tracing JIT can remove the overhead of bytecode dispatch and that
of the interpreter data structures. In this paper we want to present a new
optimization that can be added to a tracing JIT that further removes some of the
@@ -190,14 +189,15 @@
informally described in Section~\ref{sec:statics}; a more formal description is
given in Section~\ref{sec:formal}. The introduced
techniques are evaluated in Section~\ref{sec:Evaluation} using PyPy's Python
-interpreter as a case study.
+interpreter.

-The contributions of this paper are:
+The contributions made by this paper are:

\begin{enumerate}
-    \item An efficient and effective algorithm for removing object allocations in a tracing JIT.
+    \item A description of an efficient and effective algorithm for removing
+          object allocations in a tracing JIT.
\item A characterization of this algorithm as partial evaluation.
-    \item A rigorous evaluation of this algorithm.
+    \item Performance benchmarks for this algorithm.
\end{enumerate}

@@ -215,7 +215,7 @@
\emph{RPython} \cite{davide_ancona_rpython:_2007}. RPython ("restricted Python")
is a subset of Python chosen in such a way that type inference becomes
possible. The language interpreter can then be compiled (translated'') with
-PyPy's tools into a VM on the C level. During translation to C, many low-level
+PyPy's tools into a VM on C level. During translation to C, many low-level
aspects of the final VM, such as object layout, garbage collection and memory
model, are woven into the generated code. Therefore the interpreter itself can
remain at a relatively high level of abstraction.
@@ -234,13 +234,13 @@
language that the interpreter is implementing. This process is mostly
automatic; it only needs to be guided by the language implementer using a small number of
source-code hints. Mostly-automatically generating a JIT compiler has many advantages
-over writing one manually, which is an error-prone and tedious process.
+over writing one manually, an error-prone and tedious process.
By construction, the generated JIT has the same semantics as the interpreter.
-Many optimizations can benefit all languages implemented as an interpreter in RPython.
+Optimizations can be shared between different languages implemented with PyPy.

Moreover, thanks to the internal design of the JIT generator, it is very easy
to add new \emph{backends} for producing the actual machine code.  Examples of
-JIT backends that are implemented are the one for Intel x86 and x86-64 and an
+JIT backends that are implemented are those for Intel x86 and x86-64 and an
experimental one for the CLI .NET Virtual Machine \cite{cuni_high_2010}.

\subsection{Tracing JIT Compilers}
@@ -256,7 +256,7 @@
and now Python (and other languages) via PyPy.

The core idea of tracing JITs is to focus the optimization effort of the JIT
-compiler on the hot paths of the core loops of the program and to just use an
+compiler on the commonly executed, \ie \emph{hot} paths of the core loops of the program and to just use an
interpreter for the less commonly executed parts. VMs that use a tracing JIT are
mostly mixed-mode execution environments, they contain both an interpreter and a
JIT compiler. By default the interpreter is used to execute the program, doing
@@ -269,24 +269,23 @@
it always ends with a jump to its own beginning. The trace also contains all
operations that are performed in functions that were called in the loop, thus a
tracing JIT automatically performs inlining.
-
-This trace of operations is then the basis of the generated code. The trace is
+This trace of operations subsequently forms the basis of the generated code. The trace is
first optimized, and then turned into machine code. Both optimization
and machine code generation are simple, because the traces are linear. This
linearity makes many optimizations a lot more tractable, and the inlining that
happens gives the optimizations automatically more context to work with.

Since the trace corresponds to one concrete execution of a loop,
-the code generated from it is only one possible path through it.
-To make sure that the trace is maintaining the correct semantics, it contains a
+the code generated from it is only one possible path through the loop.
+To make sure that the trace maintains the correct semantics, it contains a
\emph{guard} at all places where the execution could have diverged from the
path. Those guards check the assumptions under which execution can stay on the
-trace. As an example, if a loop contains an \lstinline{if} statement, the trace
+trace. As an example, if a loop contains an if-statement, the trace
will contain the execution of one of the paths only, which is the path that was
taken during the production of the trace. The trace will also contain a guard
-that checks that the condition of the \lstinline{if} statement is the same as
+that checks that the condition of the if-statement is the same as
during tracing, because if
-it isn't, the rest of the trace is not valid. \cfbolz{The "if" shouldn't be bold}
+it isn't, the rest of the trace would not be valid.

When generating machine code, every guard is be turned into a quick check to
see whether the assumption still holds. When such a guard is hit during the
@@ -367,11 +366,11 @@
\label{fig:objmodel}
\end{figure}

-Using these classes to implement arithmetic shows the basic problem that a
-dynamic language implementation has. All the numbers are instances of either
+Using these classes to implement arithmetic shows the basic problem of a
+dynamic language implementation. All the numbers are instances of either
\lstinline{BoxedInteger} or \lstinline{BoxedFloat}, therefore they consume space on the
heap. Performing many arithmetic operations produces lots of garbage quickly,
-which puts pressure on the garbage collector. Using double dispatching to
+putthing pressure on the garbage collector. Using double dispatching to
implement the numeric tower needs two method calls per arithmetic operation,
which is costly due to the method dispatch.

@@ -384,7 +383,7 @@
calls inside the loop, one for each \lstinline{is_positive} and even two for each
call to \lstinline{add}. These method calls need to check the type of the involved
objects repeatedly and redundantly. In addition, a lot of objects are created
-when executing that loop, many of these objects do not survive for very long.
+when executing that loop, many of these objects are short-lived.
The actual computation that is performed by \lstinline{f} is simply a sequence of

@@ -589,7 +588,7 @@
the type check the guard does is statically known.

In the example from last section, the following operations in the upper half
-of Fig.~\ref{fig:unopt-trace} produce two
+of Figure~\ref{fig:unopt-trace} produce two
static objects, and can be completely removed from the optimized trace:

\begin{lstlisting}[mathescape,xleftmargin=20pt]
@@ -605,7 +604,7 @@
one associated with $p_{6}$ would know that it is a \lstinline{BoxedInteger}
whose \lstinline{intval} field contains the constant -100.

-The subsequent operations in Fig.~\ref{fig:unopt-trace},
+The subsequent operations in Figure~\ref{fig:unopt-trace},
which use $p_{5}$ and $p_{6}$, could then be
optimized using that knowledge:

@@ -628,7 +627,7 @@
$i_{9}$ = int_add($i_{4}$, -100)
\end{lstlisting}

-The rest of the trace from Fig.~\ref{fig:unopt-trace} is optimized similarly.
+The rest of the trace from Figure~\ref{fig:unopt-trace} is optimized similarly.

So far we have only described what happens when static objects are used in guards and in
operations that read and write fields. When the static
@@ -640,7 +639,7 @@
necessary to put operations into the residual code that allocate the
static object at runtime.

-This is what happens at the end of the trace in Fig.~\ref{fig:unopt-trace}, when the \lstinline{jump} operation
+This is what happens at the end of the trace in Figure~\ref{fig:unopt-trace}, when the \lstinline{jump} operation
is hit. The arguments of the jump are at this point static objects. Before the
jump is emitted, they are \emph{lifted}. This means that the optimizer produces code
that allocates a new object of the right type and sets its fields to the field
@@ -897,7 +896,7 @@
\end{lstlisting}

In this case, the static heap afterwards would be
-$\{v^* \mapsto (T_1, w^*, v^*)\}$.
+$$\{v^* \mapsto (T_1, w^*, v^*)\}$$.