[pypy-svn] r77993 - pypy/extradoc/talk/pepm2011

cfbolz at codespeak.net cfbolz at codespeak.net
Fri Oct 15 16:47:12 CEST 2010


Author: cfbolz
Date: Fri Oct 15 16:47:10 2010
New Revision: 77993

Modified:
   pypy/extradoc/talk/pepm2011/paper.tex
Log:
incorporate first round of comments by stephan


Modified: pypy/extradoc/talk/pepm2011/paper.tex
==============================================================================
--- pypy/extradoc/talk/pepm2011/paper.tex	(original)
+++ pypy/extradoc/talk/pepm2011/paper.tex	Fri Oct 15 16:47:10 2010
@@ -106,7 +106,7 @@
 The performance of many dynamic language implementations suffers from
 high allocation rates and runtime type checks.  This makes dynamic
 languages less applicable to purely algorithmic problems, despite their
-growing popularity.  In this paper, we present a simple optimization
+growing popularity.  In this paper we present a simple compiler optimization
 based on online partial evaluation to remove object allocations and
 runtime type checks in the context of a tracing JIT.  We evaluate the
 optimization using a Python VM and find that it gives good results for
@@ -130,7 +130,7 @@
 
 \section{Introduction}
 
-The goal of a just-in-time (JIT) compiler for a dynamic language is obviously to
+The objective of a just-in-time (JIT) compiler for a dynamic language is to
 improve the speed of the language over an implementation of the language that
 uses interpretation. The first goal of a JIT is therefore to remove the
 interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the
@@ -142,38 +142,37 @@
 
 Boxing of primitive types is necessary because dynamic languages need to be able to handle
 all objects, even integers, floats, booleans etc. in the same way as user-defined
-instances. Thus those primitive types are usually \emph{boxed}, i.e. a small
-heap-structure is allocated for them, that contains the actual value. Boxing
+instances. Thus those primitive types are usually \emph{boxed}, \ie a small
+heap-structure is allocated for them that contains the actual value. Boxing
 primitive types can be very costly, because a lot of common operations,
-particularly all arithmetic operations, have to produce a new box, in addition
+particularly all arithmetic operations, have to produce new boxes, in addition
 to the actual computation they do. Because the boxes are allocated on the heap,
-producing a lot of them puts pressure on the garbage collector.
+producing many of them puts pressure on the garbage collector.
 
 Type dispatching is the process of finding the concrete implementation that is
-applicable to the objects at hand when doing a generic operation on them. An
-example would be the addition of two objects: The addition needs to check what
-the concrete objects that should be added are, and choose the implementation
-that is fitting for them. Type dispatching is a very common operation in
+applicable to the objects at hand when performing a generic operation on them. An
+example would be the addition of two objects: For addition the types of the
+concrete objects need to be checked and the suiting implementation chosen.
+Type dispatching is a very common operation in
 modern\footnote{For languages in the LISP family, basic arithmetic operations
 are typically not overloaded; even in Smalltalk, type dispatching is much
 simpler than in Python or JavaScript.}
-dynamic languages because no types are known at compile time, so all operations
-need it.
+dynamic languages because no types are known at compile time. Therefore all
+operations need it.
 
 A recently popular approach to implementing just-in-time compilers for dynamic
 languages is that of a tracing JIT. A tracing JIT works by observing the running
-program and recording its hot spots into linear execution traces. Working on
-traces is the central idea of a tracing JIT. Those traces are optimized and
-turned into machine code.
+program and recording its hot spots into \emph{linear execution traces}. Those
+traces are optimized and turned into machine code.
 
 One reason for the popularity of tracing JITs is their relative
-simplicity. They can often be added to an interpreter and a lot of the
-infrastructure of the interpreter can be reused. They give some important
+simplicity. They can often be added to an existing interpreter, reusing a lot of
+the interpreter's infrastructure. They give some important
 optimizations like inlining and constant-folding for free. A tracing JIT always
-produces linear pieces of code, which simplifies many algorithms that are usually
-hard in a compiler, such as register allocation.
+produces linear pieces of code, which simplifies many of the hard algorithms in
+a compiler, such as register allocation.
 
-The usage of a tracing JIT can remove the overhead of bytecode dispatch and that
+The use of a tracing JIT can remove the overhead of bytecode dispatch and that
 of the interpreter data structures. In this paper we want to present a new
 optimization that can be added to a tracing JIT that further removes some of the
 overhead more closely associated to dynamic languages, such as boxing overhead
@@ -190,14 +189,15 @@
 informally described in Section~\ref{sec:statics}; a more formal description is
 given in Section~\ref{sec:formal}. The introduced
 techniques are evaluated in Section~\ref{sec:Evaluation} using PyPy's Python
-interpreter as a case study.
+interpreter.
 
-The contributions of this paper are:
+The contributions made by this paper are:
 
 \begin{enumerate}
-    \item An efficient and effective algorithm for removing object allocations in a tracing JIT.
+    \item A description of an efficient and effective algorithm for removing
+          object allocations in a tracing JIT.
     \item A characterization of this algorithm as partial evaluation.
-    \item A rigorous evaluation of this algorithm.
+    \item Performance benchmarks for this algorithm.
 \end{enumerate}
 
 
@@ -215,7 +215,7 @@
 \emph{RPython} \cite{davide_ancona_rpython:_2007}. RPython ("restricted Python")
 is a subset of Python chosen in such a way that type inference becomes
 possible. The language interpreter can then be compiled (``translated'') with
-PyPy's tools into a VM on the C level. During translation to C, many low-level
+PyPy's tools into a VM on C level. During translation to C, many low-level
 aspects of the final VM, such as object layout, garbage collection and memory
 model, are woven into the generated code. Therefore the interpreter itself can
 remain at a relatively high level of abstraction.
@@ -234,13 +234,13 @@
 language that the interpreter is implementing. This process is mostly
 automatic; it only needs to be guided by the language implementer using a small number of
 source-code hints. Mostly-automatically generating a JIT compiler has many advantages
-over writing one manually, which is an error-prone and tedious process.
+over writing one manually, an error-prone and tedious process.
 By construction, the generated JIT has the same semantics as the interpreter.
-Many optimizations can benefit all languages implemented as an interpreter in RPython.
+Optimizations can be shared between different languages implemented with PyPy.
 
 Moreover, thanks to the internal design of the JIT generator, it is very easy
 to add new \emph{backends} for producing the actual machine code.  Examples of
-JIT backends that are implemented are the one for Intel x86 and x86-64 and an
+JIT backends that are implemented are those for Intel x86 and x86-64 and an
 experimental one for the CLI .NET Virtual Machine \cite{cuni_high_2010}.
 
 \subsection{Tracing JIT Compilers}
@@ -256,7 +256,7 @@
 and now Python (and other languages) via PyPy.
 
 The core idea of tracing JITs is to focus the optimization effort of the JIT
-compiler on the hot paths of the core loops of the program and to just use an
+compiler on the commonly executed, \ie \emph{hot} paths of the core loops of the program and to just use an
 interpreter for the less commonly executed parts. VMs that use a tracing JIT are
 mostly mixed-mode execution environments, they contain both an interpreter and a
 JIT compiler. By default the interpreter is used to execute the program, doing
@@ -269,24 +269,23 @@
 it always ends with a jump to its own beginning. The trace also contains all
 operations that are performed in functions that were called in the loop, thus a
 tracing JIT automatically performs inlining.
-
-This trace of operations is then the basis of the generated code. The trace is
+This trace of operations subsequently forms the basis of the generated code. The trace is
 first optimized, and then turned into machine code. Both optimization
 and machine code generation are simple, because the traces are linear. This
 linearity makes many optimizations a lot more tractable, and the inlining that
 happens gives the optimizations automatically more context to work with.
 
 Since the trace corresponds to one concrete execution of a loop,
-the code generated from it is only one possible path through it.
-To make sure that the trace is maintaining the correct semantics, it contains a
+the code generated from it is only one possible path through the loop.
+To make sure that the trace maintains the correct semantics, it contains a
 \emph{guard} at all places where the execution could have diverged from the
 path. Those guards check the assumptions under which execution can stay on the
-trace. As an example, if a loop contains an \lstinline{if} statement, the trace
+trace. As an example, if a loop contains an if-statement, the trace
 will contain the execution of one of the paths only, which is the path that was
 taken during the production of the trace. The trace will also contain a guard
-that checks that the condition of the \lstinline{if} statement is the same as
+that checks that the condition of the if-statement is the same as
 during tracing, because if
-it isn't, the rest of the trace is not valid. \cfbolz{The "if" shouldn't be bold}
+it isn't, the rest of the trace would not be valid.
 
 When generating machine code, every guard is be turned into a quick check to
 see whether the assumption still holds. When such a guard is hit during the
@@ -367,11 +366,11 @@
 \label{fig:objmodel}
 \end{figure}
 
-Using these classes to implement arithmetic shows the basic problem that a
-dynamic language implementation has. All the numbers are instances of either
+Using these classes to implement arithmetic shows the basic problem of a
+dynamic language implementation. All the numbers are instances of either
 \lstinline{BoxedInteger} or \lstinline{BoxedFloat}, therefore they consume space on the
 heap. Performing many arithmetic operations produces lots of garbage quickly,
-which puts pressure on the garbage collector. Using double dispatching to
+putthing pressure on the garbage collector. Using double dispatching to
 implement the numeric tower needs two method calls per arithmetic operation,
 which is costly due to the method dispatch.
 
@@ -384,7 +383,7 @@
 calls inside the loop, one for each \lstinline{is_positive} and even two for each
 call to \lstinline{add}. These method calls need to check the type of the involved
 objects repeatedly and redundantly. In addition, a lot of objects are created
-when executing that loop, many of these objects do not survive for very long.
+when executing that loop, many of these objects are short-lived.
 The actual computation that is performed by \lstinline{f} is simply a sequence of
 float or integer additions.
 
@@ -589,7 +588,7 @@
 the type check the guard does is statically known.
 
 In the example from last section, the following operations in the upper half
-of Fig.~\ref{fig:unopt-trace} produce two
+of Figure~\ref{fig:unopt-trace} produce two
 static objects, and can be completely removed from the optimized trace:
 
 \begin{lstlisting}[mathescape,xleftmargin=20pt]
@@ -605,7 +604,7 @@
 one associated with $p_{6}$ would know that it is a \lstinline{BoxedInteger}
 whose \lstinline{intval} field contains the constant -100.
 
-The subsequent operations in Fig.~\ref{fig:unopt-trace},
+The subsequent operations in Figure~\ref{fig:unopt-trace},
  which use $p_{5}$ and $p_{6}$, could then be
 optimized using that knowledge:
 
@@ -628,7 +627,7 @@
 $i_{9}$ = int_add($i_{4}$, -100)
 \end{lstlisting}
 
-The rest of the trace from Fig.~\ref{fig:unopt-trace} is optimized similarly.
+The rest of the trace from Figure~\ref{fig:unopt-trace} is optimized similarly.
 
 So far we have only described what happens when static objects are used in guards and in
 operations that read and write fields. When the static
@@ -640,7 +639,7 @@
 necessary to put operations into the residual code that allocate the
 static object at runtime.
 
-This is what happens at the end of the trace in Fig.~\ref{fig:unopt-trace}, when the \lstinline{jump} operation
+This is what happens at the end of the trace in Figure~\ref{fig:unopt-trace}, when the \lstinline{jump} operation
 is hit. The arguments of the jump are at this point static objects. Before the
 jump is emitted, they are \emph{lifted}. This means that the optimizer produces code
 that allocates a new object of the right type and sets its fields to the field
@@ -897,7 +896,7 @@
 \end{lstlisting}
 
 In this case, the static heap afterwards would be
-$\{v^* \mapsto (T_1, w^*, v^*)\}$.
+$$\{v^* \mapsto (T_1, w^*, v^*)\}$$.
 
 
 



More information about the Pypy-commit mailing list