[pypy-commit] extradoc extradoc: merge

Mon Aug 13 09:35:52 CEST 2012

Author: Hakan Ardo <hakan at debian.org>
Branch: extradoc
Changeset: r4534:00af94f610b2
Date: 2012-08-13 09:35 +0200
http://bitbucket.org/pypy/extradoc/changeset/00af94f610b2/

Log:	merge

diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index 0bfb4121074fae4028d49aea25f9c0e2fa42dd53..d0e3ca21bc58e605bbf333d46f6acdc18de2a29d
GIT binary patch

[cut]

diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -101,9 +101,9 @@
 
 \begin{document}
 
-\conferenceinfo{IWTC '11}{XXX} 
-\copyrightyear{2011} 
-\copyrightdata{[to be supplied]} 
+\conferenceinfo{DLS'12,} {October 22, 2012, Tucson, Arizona, USA.}
+\CopyrightYear{2012}
+\copyrightdata{978-1-4503-1564-7/12/10}
 
 \titlebanner{draft}        % These are ignored unless
 %\preprintfooter{short description of paper}   % 'preprint' option specified.
@@ -129,9 +129,12 @@
 motion which is a very important optimization for code with tight kernels.
 Especially for dynamic languages that typically perform quite a lot of loop invariant
 type checking, boxed value unwrapping and virtual method lookups.
-In this paper we present a scheme for making simple optimizations loop-aware by
+In this paper we explain a scheme invented within the context of the LuaJIT project
+for making simple optimizations loop-aware by
 using a simple pre-processing step on the trace and not changing the
-optimizations themselves. The scheme can give performance improvements of a
+optimizations themselves.
+We have implemented the scheme in PyPy's tracing JIT compiler,
+where it can give performance improvements of a
 factor over two for PyPy's Python JIT executing simple numerical kernels
 bringing the performance close to that of compiled C code.
 \end{abstract}
@@ -152,7 +155,7 @@
 significant amount of the execution time might be spent on such tasks
 instead of the actual computations. Moreover, the type checking,
 unwrapping and method lookups are often loop invariant and performance could be increased
-by moving those operations out of the loop. We propose a simple scheme
+by moving those operations out of the loop. We explain a simple scheme
 to make a tracing JIT loop-aware by allowing it's existing optimizations to
 perform loop invariant code motion. 
 
@@ -176,11 +179,16 @@
 Having to deal with this property of traces complicates the optimization passes,
 as a more global view of a trace needs to be considered when optimizing.
 
-In this paper we want to address this problem by proposing a scheme that
-makes it possible to turn optimizations using one forward pass into
-optimizations that can do loop invariant code motion and similar loop-aware
-improvements. Using this scheme one does not need to change the underlying
-optimization much to get these advantages.
+Mike Pall pioneered a solution to address this problem in the context of a
+dynamic language using a tracing JIT compiler. He published his algorithm and
+its rationale in 2009~\cite{pall_luajit_2009} and implemented it in LuaJIT
+2.0\footnote{\texttt{http://luajit.org/}}, an open source JIT compiler for the Lua
+language. His approach allows to reuse all forward pass
+optimizations to achieve loop invariant code motion and other loop-related
+optimizations, which greatly simplifies the implementation. Using this scheme
+one does not need to change the underlying optimization much to get these
+advantages. We have implemented the same approach in PyPy's tracing JIT
+compiler the results of which we present here.
 
 The resulting optimizations one gets using this scheme are in no way novel, most
 of them are well-known loop optimizations. However, the way to implement them is
@@ -248,9 +256,9 @@
 new value of $i_0$ is $i_0$, making it a loop-invariant.
 
 Because $i_0$ is loop-invariant, the addition could be moved out of the loop.
-However, we want to get this effect using our existing optimization passes
+However, it is desirable to get this effect using our existing optimization passes
 without changing them too much. Optimizations with one forward pass
-cannot directly get this effect: They just look at the trace without taking
+cannot directly achieve this effect: They just look at the trace without taking
 into account that the trace executes many times in a row. Therefore to achieve
 loop-invariant code motion, we peel one iteration off the loop before running
 the optimizations. This peeling gives the following trace:
@@ -313,7 +321,7 @@
 arguments are inserted into the label of the loop itself and the jumps
 afterwards.
 
-This is the key insight of the proposed implementation scheme: If an
+This is the key insight of the implementation scheme: If an
 optimization is given two iterations together at the same time, the
 optimization has enough context to remove operations from the peeled loop,
 because it detects
@@ -476,7 +484,7 @@
 it is optimized to achieve better performance.
 One goal of that is to move 
 operations out of the loop making them executed only once
-and not every iteration. We propose to achieve this by loop peeling. It
+and not every iteration. This can be achieved by loop peeling. It
 leaves the loop body intact, but prefixes it with one iteration of the
 loop. This operation by itself will not achieve anything. But if it is
 combined with other optimizations it can increase the effectiveness of
@@ -612,7 +620,7 @@
             set($p_{9}$, intval, $i_{8}$)
 jump($L_1$, $p_{0}$, $p_{9}$)
 \end{lstlisting}
-\caption{A peeled trace of the Example Interpreter}
+\caption{A peeled trace of the example interpreter}
 \label{fig:peeled-trace}
 \end{figure}
 
@@ -911,13 +919,6 @@
 }
 
 \revd{
-The benchmark results appear quite impressive -- especially the comparison with
-GCC -- but without additional information, I have no idea what is being
-compared.  Are these results from the same sizes of integers and/or floating
-point results?
-}
-
-\revd{
 This paper is relatively short, and could be significantly improved with a
 couple of pages of additional information about the details of the benchmarks
 -- both on the Python and on the C side.
@@ -1051,7 +1052,8 @@
 a straightforward implementation providing 2 dimensional
 indexing with out of bounds checks. For the C implementations it is
 implemented as a C++ class. The other benchmarks are implemented in
-plain C. 
+plain C. All the benchmarks except sqrt operate on C double-precision floating
+point numbers, both in the Python and the C code.
 
 Benchmarks were run on Intel i7 M620 @2.67GHz with 4M cache and 8G of RAM
 using Ubuntu Linux 11.4 in 32bit mode.
@@ -1065,7 +1067,7 @@
 \item GCC 4.4.5 shipped with Ubuntu 11.4
 \end{itemize}
 
-We run GCC both with -O2 optimization and -O3 -march=native, disabling the
+We run GCC with -O3 -march=native, disabling the
 automatic loop vectorization. In all cases, SSE2 instructions were used for
 floating point operations, except Psyco which uses x87 FPU instructions.
 We also run PyPy with loop peeling optimization and without (but otherwise
@@ -1084,7 +1086,7 @@
 work~\cite{bolz_allocation_2011, bolz_runtime_2011}. The geometric mean of the
 speedup of loop peeling is 70\%, which makes benchmark times
 comparable with native-compiled C code. We attribute the performance gap to C code to
-the relative immaturity of RPython's JIT assembler backend as well as missing
+the relative immaturity of RPython's JIT machine code backend as well as missing
 optimizations, like instruction scheduling.
 
 Other interesting interpreters that are helped greatly by this optimization are
@@ -1098,29 +1100,27 @@
 \section{Related Work}
 \label{sec:related}
 
-Loop invariant code motion optimizations are completely
-standard~\cite{muchnick_advanced_1997}. Therefore, the effects that our
-optimization achieves are not in any way new. However, we think that achieving
-them in the way described in this paper is simpler than writing explicit algorithms.
+Loop invariant code motion optimizations are a well-known approach to optimize
+loops~\cite{muchnick_advanced_1997}. Therefore, the effects that the
+optimizations described here achieve are not in any way new. However, we think
+that achieving them in the way described in this paper is simpler than writing
+explicit algorithms.
+\cfbolz{more explicit listing of prior work goes here}
 
-\revc{
-The discussion of LuaJIT is unsatisfying.  It's not clear to me from that one
-quote that Mike is doing the same thing.  It might be worth including LuaJIT in
-the benchmarks, and/or examining the actual implementation of LuaJIT.
-}
-\cfbolz{maybe we can look in the new LuaJIT wiki.
-how annoying would it be to rerun the benchmarks, if I can find somebody to write them?}
-\hakan{there is iwtc11/benchmarks/runall.sh which is supposed to run them all}
+As described in the introduction,
+Mike Pall pioneered the approach described in this paper.
+He showed that, unlike traditional loop-invariant code motion
+(LICM), this approach is effective, even in the presence of many
+guards and global control dependencies, which are caused by the
+semantics of dynamic languages.
 
-Mike Pall, the author of LuaJIT\footnote{\texttt{http://luajit.org/}} seems to
-have developed the described technique independently. There are no papers about
-LuaJIT but the author of it writes on a mailing list: ``The LOOP pass does
-synthetic unrolling of the recorded IR, combining copy-substitution with
-redundancy elimination to achieve code hoisting. The unrolled and
-copy-substituted instructions are simply fed back into the compiler pipeline,
-which allows reuse of all optimizations for redundancy elimination. Loop
-recurrences are detected on-the-fly and a minimized set of PHIs is
-generated.''~\cite{pall_luajit_2009}
+He writes on the Lua-users mailing list:
+``The LOOP pass does synthetic unrolling of the recorded IR, combining
+copy-substitution with redundancy elimination to achieve code hoisting. The
+unrolled and copy-substituted instructions are simply fed back into the
+compiler pipeline, which allows reuse of all optimizations for redundancy
+elimination. Loop recurrences are detected on-the-fly and a minimized set of
+PHIs is generated.''~\cite{pall_luajit_2009}
 
 Both the Hotpath VM~\cite{gal_hotpathvm:_2006} and
 SPUR~\cite{bebenita_spur:_2010} implements loop-invariant code motion
@@ -1142,9 +1142,9 @@
 \section{Conclusions}
 
 In this paper we have studied loop invariant code motion during trace
-compilation. We claim that loop peeling is a very convenient solution
-here since it fits well with other trace optimizations and does not require
-large changes to them. This approach improves the effect of standard
+compilation. We claim that the loop peeling approach of LuaJIT is a very convenient solution
+since it fits well with other trace optimizations and does not require
+large changes to them. The approach improves the effect of standard
 optimizations such as redundant guard removal, common subexpression elimination
 and allocation removal. The most prominent effect is that they all become loop
 invariant code motion optimizations.
@@ -1167,7 +1167,9 @@
 
 \acks
 We would like to thank Samuele Pedroni, Sven Hager and the anonymous reviewers
-for helpful comments on drafts of this paper.
+for helpful comments on drafts of this paper. We owe deep gratitude to Mike Pall
+for making his impressive work on LuaJIT available and for detailed help on a
+draft of the paper.
 
 % We recommend abbrvnat bibliography style.
 
diff --git a/talk/vmil2012/paper.tex b/talk/vmil2012/paper.tex
--- a/talk/vmil2012/paper.tex
+++ b/talk/vmil2012/paper.tex
@@ -44,7 +44,7 @@
   urlcolor=black,%
   citecolor=black,%
   linkcolor=black,%
-  pdftitle={Efficiently Handling Guards in the Low Level Design of RPython's Tracing JIT},%
+  pdftitle={The Efficient Handling of Guards in the Design of RPython's Tracing JIT},%
   pdfauthor={David Schneider},
 }
 
@@ -86,7 +86,7 @@
 
 \begin{document}
 
-\title{Efficiently Handling Guards in the Low Level Design of RPython's Tracing JIT}
+\title{The Efficient Handling of Guards in the Design of RPython's Tracing JIT}
 
 \authorinfo{David Schneider$^{a}$ \and Carl Friedrich Bolz$^a$}
            {$^a$Heinrich-Heine-Universit&#228;t D&#252;sseldorf, STUPS Group, Germany
@@ -121,24 +121,32 @@
 %___________________________________________________________________________
 \section{Introduction}
 
+\todo{the introduction needs some work}
+\cfbolz{the first two two paragraphs talk about deoptimization, then it
+switches to guards. I would say we should only talk about guards in the
+beginning}
 In this paper we describe and analyze how deoptimization works in the context
 of tracing just-in-time compilers. What instructions are used in the
 intermediate and low-level representation of the JIT instructions and how these
 are implemented.
 
+\cfbolz{I would kill this paragraph}
 Although there are several publications about tracing just-in-time compilers,
 to our knowledge, there are none that describe deoptimization and the use and
 implementation of guards in this context.
 
+The goal of this paper is to understand the design constraints when
+implementing guards. Guards have a runtime cost, they take time to execute. On
+the other hand, guards are possible deoptimization points. They need to store
+enough information to rebuild the interpreter state.
 Based on the informal observation that guards are among the most common
-operations in the traces produced by RPython's tracing JIT and that guards are
-operations that are associated with an overhead to maintain information about
-the execution state to be able to rebuild it in case of deoptimization, our
+operations in the traces produced by RPython's tracing JIT, our
 goal is to present concrete numbers for the frequency and the overhead related
 to guards, explain how they are implemented in the different levels of RPython's
 tracing JIT and explain the rationale behind the design decisions based on the
 numbers provided here.
 
+\cfbolz{this paragraph now suddenly \emph{introduces} guards, despite having talked about them already}
 The operations executed by an interpreter are recorded by the tracing JIT in
 case they are frequently executed, this process is described in more detail in
 Section~\ref{sec:Resume Data}, during the recording phase special operations,
@@ -152,8 +160,8 @@
 in the design and optimization of guards, the first aspect is that due to the
 large number of guards the memory overhead related to storing the information
 needed for deoptimization should be kept low. A second aspect is that
-successfully checking guards, i.e. not leaving the compiled trace,  - which is
-the common case - should be a cheap operation to execute favouring the on-trace
+successfully checking guards, i.e. not leaving the compiled trace, &#8211; which is
+the common case &#8211; should be a cheap operation to execute favouring the on-trace
 execution speed in contrast to the deoptimization case where the state has to
 be rebuilt using the stored information. These constraints and trade-offs are
 what make the design and optimization of guards an important and non-trivial
@@ -164,15 +172,16 @@
 %stored at the different levels for the guards
 In this paper we want to substantiate the aforementioned observations and
 describe based on them the reasoning behind and the implementation of guards in
-RPython's tracing just-in-time compiler, the contributions of this paper are:
+RPython's tracing just-in-time compiler. the contributions of this paper are:
 \begin{itemize}
   \item An analysis of guards in the context of RPython's tracing JIT to
-  substantiate the aforementioned observation, based on a set of benchmarks.
-  \item We provide a detailed measurements about the frequency and the
-  overhead associated with guards.
-  \item We provide a description about how guards are implemented in the high\-
+  substantiate the aforementioned observation, based on a set of benchmarks,
+  \item detailed measurements about the frequency and the
+  overhead associated with guards, and
+  \item a description about how guards are implemented in the high\-
   and low-level parts of the JIT and describe the rationale behind the design.
 \end{itemize}
+
 \begin{figure}
     \include{figures/guard_table}
     \caption{Percentage of guards before and after optimization for different benchmarks}
@@ -203,7 +212,7 @@
 \label{sub:pypy}
 
 
-The RPython language and the PyPy Project were started in 2002 with the goal of
+The RPython language and the PyPy project were started in 2002 with the goal of
 creating a Python interpreter written in a high level language, allowing easy
 language experimentation and extension. PyPy is now a fully compatible
 alternative implementation of the Python language\bivab{mention speed}. The
@@ -218,7 +227,7 @@
 RPython is built of two components, the language and the translation toolchain
 used to transform RPython programs to executable units.  The RPython language
 is a statically typed object oriented high level language. The language provides
-several features such as automatic memory management (aka. Garbage Collection)
+several features such as automatic memory management
 and just-in-time compilation. When writing an interpreter using RPython the
 programmer only has to write the interpreter for the language she is
 implementing.  The second RPython component, the translation toolchain, is used
@@ -235,9 +244,13 @@
 observing the execution of a program. VMs using tracing JITs are typically
 mixed mode execution environments containing also an interpreter. The
 interpreter profiles the executed program and selects frequently executed code
-paths to be compiled to machine code. After profiling identified an interesting
+paths to be compiled to machine code. Many tracing JIT compilers focus on
+selecting hot loops.
+
+After profiling identified an interesting
 path, tracing is started, recording all operations that are executed on this
-path. Like in most compilers tracing JITs use an intermediate representation to
+path. This includes inlining functional calls.
+Like most compilers, tracing JITs use an intermediate representation to
 store the recorded operations, which is typically in SSA
 form~\cite{cytron_efficiently_1991}. Since tracing follows actual execution the
 code that is recorded
@@ -245,6 +258,9 @@
 divergence from the recorded path are marked with special operations called
 \emph{guards}, these operations ensure that assumptions valid during the
 tracing phase are still valid when the code has been compiled and is executed.
+In the case of dynamic languages, guards are also used to encode type checks
+that come from optimistic type specialization by recording the types of
+variables seen during tracing.
 After a trace has been recorded it is optimized and then compiled to platform
 specific machine code.
 
@@ -290,7 +306,10 @@
 Since tracing linearizes control flow by following one concrete execution,
 not the full control flow of a program is observed.
 The possible points of deviation from the trace are guard operations
-that check whether the same assumptions observed during tracing still hold during execution.
+that check whether the same assumptions observed during tracing
+still hold during execution.
+Similarly, in the case of dynamic languages guards can also encode type
+assumptions.
 In later executions of the trace the guards can fail.
 If that happens, execution needs to continue in the interpreter.
 This means it is necessary to attach enough information to a guard
@@ -335,13 +354,20 @@
 \subsection{Compression of Resume Data}
 \label{sub:compression}
 
+After tracing has been finished the trace is optimized.
+During optimization a large percentage of operations can be removed.
+In the process the resume data is transformed into its final, compressed form.
+The rationale for not compressing the resume data during tracing
+is that a lot of guards will be optimized away.
+For them, the compression effort would be lost.
+
 The core idea of storing resume data as compactly as possible
 is to share parts of the data structure between subsequent guards.
 This is often useful because the density of guards in traces is so high,
 that quite often not much changes between them.
 Since resume data is a linked list of symbolic frames
 often only the information in the top frame changes from one guard to the next.
-The other frames can often be just reused.
+The other symbolic frames can often just be reused.
 The reason for this is that during tracing only the variables
 of the currently executing frame can change.
 Therefore if two guards are generated from code in the same function
@@ -393,7 +419,7 @@
 is RPython's allocation removal optimization~\cite{bolz_allocation_2011}.
 This optimization discovers allocations in the trace that create objects
 that do not survive long.
-An example is the instance of \lstinline{Even} in the example\cfbolz{reference figure}.
+An example is the instance of \lstinline{Even} in Figure~\ref{fig:unopt-trace}.
 Allocation removal makes resume data more complex.
 Since allocations are removed from the trace it becomes necessary
 to reconstruct the objects that were not allocated so far when a guard fails.
@@ -435,7 +461,7 @@
 
 Figure~\ref{fig:trace-log} shows the optimized version of the trace in
 Figure~\ref{fig:fig:unopt-trace}. Allocation removal has removed the
-\lstinline{new} operation and other operations handling the boxes. The
+\lstinline{new} operation and other operations handling the instance. The
 operations handle unboxed numbers now.
 
 Figure~\ref{fig:resume-data} sketches the symbolic frames of the first two
@@ -466,7 +492,7 @@
 After optimization the resulting trace is handed to the over platform specific
 backend to be compiled to machine code. The compilation phase consists of two
 passes over the lists of instructions, a backwards pass to calculate live
-ranges of IR-level variables and a forward one to emit the instructions. During
+ranges of IR-level variables and a forward pass to emit the instructions. During
 the forward pass IR-level variables are assigned to registers and stack
 locations by the register allocator according to the requirements of the to be
 emitted instructions.  Eviction/spilling is performed based on the live range
@@ -476,7 +502,7 @@
 emitted. Guards instructions are transformed into fast checks at the machine
 code level that verify the corresponding condition.  In cases the value being
 checked by the guard is not used anywhere else the guard and the operation
-producing the value can merged, reducing even more the overhead of the guard.
+producing the value can often merged, reducing even more the overhead of the guard.
 Figure \ref{fig:trace-compiled} shows how an \texttt{int\_eq} operation
 followed by a guard that checks the result of the operation are compiled to
 pseudo-assembler if the operation and the guard are compiled separated or if
@@ -523,10 +549,10 @@
 information provided by the register allocator about where the values
 corresponding to each IR-variable required by the guard will be stored when
 execution reaches the code emitted for the corresponding guard. This data
-structure stores the data in a compressed manner using an encoding the uses
+structure stores the data in a compressed manner using an encoding that uses
 8bits to store 7bits of information. This encoding is efficient to create and
-provides a compact representation of the needed information. This encoding
-needs to be as compact as possible to maintain an acceptable memory profile.
+provides a compact representation of the needed information,
+to maintain an acceptable memory profile.
 
 Second a piece of code is generated for each guard that acts as a trampoline.
 Guards are implemented as a conditional jump to this trampoline. In case the
@@ -555,7 +581,7 @@
 a new trace, referred to as a \emph{bridge}, starting from this guard is recorded and
 compiled. When compiling bridges the goal is that future failures of the guards
 that led to the compilation of the bridge should execute the bridge without
-additional overhead, in particular the failure of the guard should not lead
+additional overhead. In particular the failure of the guard should not lead
 to leaving the compiled code prior to execution the code of the bridge.
 
 The process of compiling a bridge is very similar to compiling a loop.
@@ -567,7 +593,8 @@
 representation created for the guard to rebuild the bindings from IR-variables
 to stack locations and registers used in the register allocator.  With this
 reconstruction all bindings are restored to the state as they were in the
-original loop up to the guard.
+original loop up to the guard. This means that no register/stack reshuffling is
+needed before executing a bridge.
 
 Once the bridge has been compiled the guard that led to compiling the bridge is
 patched to redirect control flow to the bridge in case the check fails. In
@@ -594,6 +621,7 @@
 \section{Evaluation}
 \label{sec:evaluation}
 \todo{improve the table formatting}
+\todo{give a reference to the benchmark scripts to make things repeatable}
 
 The results presented in this section are based on numbers gathered by running
 a subset of the standard PyPy benchmarks. The PyPy benchmarks are used to
@@ -701,7 +729,7 @@
 about 15\% to 20\% of the amount of memory compared to the size of the
 generated machine code. On the other hand the generated machine code has only a
 size ranging from 20.5\% to 37.98\% of the size of the high and low-level
-\texttt{resume data} combined and being compressed as described before.
+resume data combined and being compressed as described before.
 
 Tracing JIT compilers only compile the subset of the code executed in a program
 that is traced in a hot loop, for this reason the amount of generated machine
@@ -859,7 +887,7 @@
 and their fields filled with the values
 described by the deoptimization information.
 The paper does not describe any attempts to store this information compactly.
-This may not be needed in their approach, because method-based JITs have a lot
+This may not be needed in their approach, because method-based JITs have
 fewer deoptimization points than tracing JITs.