cfbolz at codespeak.net cfbolz at codespeak.net
Wed Dec 17 11:46:20 CET 2008

Author: cfbolz
Date: Wed Dec 17 11:46:18 2008
New Revision: 60532

Removed:
Modified:
Log:
steal more text

==============================================================================
+++ pypy/extradoc/talk/ecoop2009/benchmarks.tex	Wed Dec 17 11:46:18 2008
@@ -2,12 +2,14 @@

\anto{maybe we should move this section somewhere else, if we want to use TLC
as a running example in other sections}
+introduction?}

In this section, we will briefly describe \emph{TLC}, a simple dynamic
language that we developed to exercise our JIT compiler generator.  As most of
dynamic languages around, \emph{TLC} is implemented through a virtual machine
that interprets a custom bytecode. Since our main interest is in the runtime
-performances of the VM, we did not implement the parser nor the bytecode
+performance of the VM, we did not implement the parser nor the bytecode
compiler, but only the VM itself.

TLC provides four different types:
@@ -18,8 +20,8 @@
\item Lisp-like lists
\end{enumerate}

-Objects represent a collection of named attributes (much like Javascript or
-SELF) and named methods.  At creation time, it is necessary to specify the set
+Objects represent a collection of named attributes (much like JavaScript or
+Self) and named methods.  At creation time, it is necessary to specify the set
of attributes of the object, as well as its methods.  Once the object has been
created, it is not possible to add/remove attributes and methods.

@@ -40,15 +42,19 @@
\end{itemize}

Obviously, not all the operations are applicable to all objects. For example,
-it is not possibile to \lstinline{ADD} an integer and an object, or reading an
+it is not possible to \lstinline{ADD} an integer and an object, or reading an
attribute from an object which does not provide it.  Being a dynamic language,
the VM needs to do all these checks at runtime; in case one of the check
fails, the execution is simply aborted.

\anto{should we try to invent a syntax for TLC and provide some examples?}
+\cfbolz{we should provide an example with the assembler syntax}

\section{Benchmarks}

+\cfbolz{I think this should go to the beginning of the description of the TLC as
+it explains why it is written as it is written}
+
Despite being very simple and minimalistic, \lstinline{TLC} is a good
candidate as a language to run benchmarks, as it has some of the features that
makes most of current dynamic languages so slow:
@@ -58,15 +64,15 @@
\item \textbf{Stack based VM}: this kind of VM requires all the operands to be
on top of the evaluation stack.  As a consequence programs spend a lot of
time pushing and popping values to/from the stack, or doing other stack
-  related operations.  However, thanks ot its semplicity this is still the
+  related operations.  However, thanks to its simplicity this is still the
most common and preferred way to implement VMs.

\item \textbf{Boxed integers}: integer objects are internally represented as
an instance of the \lstinline{IntObj} class, whose field \lstinline{value}
-  contains the real value.  By having boxed integers, common ariithmetic
+  contains the real value.  By having boxed integers, common arithmetic
operations are made very slow, because each time we want to load/store their
value we need to go through an extra level of indirection.  Moreover, in
-  case of a complex expression, it is necessary to create a lot of temporary
+  case of a complex expression, it is necessary to create many temporary
objects to hold intermediate results.

\item \textbf{Dynamic lookup}: attributes and methods are looked up at

==============================================================================
+++ pypy/extradoc/talk/ecoop2009/intro.tex	Wed Dec 17 11:46:18 2008
@@ -10,16 +10,19 @@
it still require a lot of work, as IronPython, Jython, JRuby demonstrate.

Moreover, writing a static compiler is often not enough to get high
-performances; IronPython and JRuby are going in the direction of JIT compiling
+performance; IronPython and JRuby are going in the direction of JIT compiling
specialized versions of the code depending on the actual values/types seen at
-runtime; this approach seems to work, but write it manually requires an
+runtime; this approach seems to work, but writing it manually requires an
enormous effort.

-PyPy's idea is to automatize the generation of static/JIT compilers in order
-to reduce at minimun the effort required to get a fast implementation of a
+\cfbolz{we should cite the dyla paper somewhere here}
+
+PyPy's idea is to automatize the generation of JIT compilers in order
+to reduce to a minimum the effort required to get a fast implementation of a
dynamic language; all you have to do is to write a high level specification of
-the language (by writing an interpreter), and put it through PyPy's
-translation toolchain.
+the language (by writing an interpreter), and putting it through PyPy's
+translation toolchain. The automatic generation of JIT compilers is done with
+the help of partial evaluation techniques.

\subsection{PyPy and RPython}

@@ -49,7 +52,7 @@
Compilation of the interpreter is implemented as a stepwise
refinement by means of a translation toolchain which performs type
analysis, code optimizations and several transformations aiming at
-incrementally providing implementation details as memory management or the threading model.
+incrementally providing implementation details such as memory management or the threading model.
The different kinds of intermediate codes  which are refined
during the translation process are all represented by a collection of control flow graphs,
at several levels of abstractions.
@@ -60,15 +63,23 @@
Currently, three fully developed backends are available to produce
executable C/POSIX code, Java and CLI/.NET bytecode.

-Despite the PyPy infrastructure was specifically developed
-for Python, in fact it can be used for implementing
-other languages. Indeed, PyPy has been successfully experimented with
-several languages as Smalltalk \cite{BolzEtAl08}, JavaScript, Scheme and Prolog.
+Despite having been specifically developed for Python, the PyPy infrastructure
+can in fact be used for implementing other languages. Indeed, there were
+successful experiments of using PyPy to implement several other languages such
+as Smalltalk \cite{BolzEtAl08}, JavaScript, Scheme and Prolog.
As suggested by Figure~\ref{pypy-fig}, a portable interpreter for a
generic language $L$  can be
-easily developed once an abstract interpreter for $L$ is implemented in
+easily developed once an interpreter for $L$ has been implemented in
RPython.

+\subsection{PyPy and JIT-Generation}
+
+This section will give a high-level overview of how the JIT-generation process
+works. More details will be given in subsequent sections.
+
+
Another interesting feature of PyPy
is that just-in-time compilers can be semi-automatically generated from the
interpreter source.
+
+XXX list contributions clearly

==============================================================================
+++ pypy/extradoc/talk/ecoop2009/jitgen.tex	Wed Dec 17 11:46:18 2008
@@ -16,15 +16,46 @@
uses the same techniques but it's manually written instead of being
automatically generated.

-The original idea is by Futamura \cite{Futamura99}. He proposed to generate compilers
-from interpreters with automatic specialization, but his work has had
-relatively little practical impact so far.
-
\subsection{Partial evaluation}

-Assume the Python bytecode to be constant, and constant-propagate it into the
-Python interpreter.
-\cfbolz{note to self: steal bits from the master thesis?}
+In 1971 Yoshihiko Futamura published a paper \cite{Futamura99} that proposed a
+technique to automatically transform an interpreter of a programming language
+into a compiler for the same language. This would solve the problem of having to
+write a compiler instead of a much simpler interpreter. He proposed to use
+partial evaluation to achieve this goal. He defined partial evaluation along the following lines:
+
+Given a program $P$ with $m + n$ input variables $s_1, ..., s_m$ and $d_1, ..., +d_n$, the partial evaluation of $P$ with respect to concrete values $s'_1, ..., +s'_m$ for the first $m$ variables is a program $P'$. The program $P'$ takes only
+the input variables $d_1, ..., d_n$ but behaves exactly like $P$ with the
+concrete values (but is hopefully more efficient). This transformation is done
+by a program $S$, the partial evaluator, which takes $P$ and $s_1, ..., s_m$ as
+input:
+
+    $$S(P, (s'_1, ..., s'_m)) = P'$$
+
+The variables $s_1, ..., s_m$ are called the \emph{static} variables, the
+variables $d_1, ..., d_n$ are called the \emph{dynamic} variables; $P'$ is the
+\emph{residual code}. Partial evaluation creates a version of $P$ that works
+only for a fixed set of inputs for the first $m$ arguments. This effect is
+called \emph{specialization}.
+
+When $P$ is an interpreter for a programming language, then the $s_1, ..., s_m$
+are chosen such that they represent the program that the interpreter is
+interpreting and the $d_1, ..., d_n$ represent the input of this program. Then
+$P'$ can be regarded as a compiled version of the program that the chosen $s'_1, +..., s'_m$ represent, since it is a version of the interpreter that can only
+interpret this program. Now once the partial evaluator $S$ is implemented, it is
+actually enough to implement an interpreter for a new language and use $S$
+together with this interpreter to compile programs in that new language.
+
+A valid implementation for $S$ would be to just put the concrete values into $P$
+to get $P'$, which would not actually produce any performance benefits compared with
+directly using $P$. A good implementation for $S$ should instead make use of the
+information it has and evaluate all the parts of the program that actually
+depend only on the $s_1, ..., s_m$ and to remove parts of $P$ that cannot be
+reached given the concrete values.
+

\cfbolz{I would propose to use either TLC as an example here, or something that
looks at least like an interpreter loop}
@@ -102,8 +133,9 @@
control flow graphs of the interpreter, guided by the binding times.  We
call this process \emph{timeshifting}.

+XXX write something about the problems of classical PE?

-\subsection{Execution steps}
+\subsection{Partial Evaluation in PyPy}

PyPy contains a framework for generating just-in-time compilers using
@@ -139,16 +171,20 @@
colors that we use when displaying the control flow graphs:

\begin{itemize}
-\item \emph{Green} variables contain values that are known at compile-time;
-\item \emph{Red} variables contain values that are not known until run-time.
+\item \emph{Green} variables contain values that are known at compile-time.
+They correspond to static arguments.
+\item \emph{Red} variables contain values that are usually not known
+compile-time. They correspond to dynamic arguments.
\end{itemize}

-The binding-time analyzer of our translation tool-chain is based on the
+The binding-time analyzer of our translation tool-chain is using a simple
+abstract-interpretation based analysis. It is based on the
same type inference engine that is used on the source RPython program,
the annotator.  In this mode, it is called the \emph{hint-annotator}; it
RPython-level, and propagates annotations that do not track types but
value dependencies and manually-provided binding time hints.
+XXX the above needs rewriting when the background section is there

The normal process of the hint-annotator is to propagate the binding
time (i.e. color) of the variables using the following kind of rules:
@@ -196,9 +232,9 @@
the hint-annotator as a request for both \texttt{v1} and \texttt{v2} to be green.  It
has a \emph{global} effect on the binding times: it means that not only
\texttt{v1} but all the values that \texttt{v1} depends on – recursively –
-are forced to be green.  The hint-annotator complains if the
+are forced to be green.  The hint-annotator gives an error if the
dependencies of \texttt{v1} include a value that cannot be green, like
-a value read out of a field of a non-immutable structure.
+a value read out of a field of a non-immutable instance.

Such a need-oriented backward propagation has advantages over the
commonly used forward propagation, in which a variable is compile-time
@@ -211,9 +247,6 @@
it prevents under-specialization: an unsatisfiable \texttt{hint(v1,
concrete=True)} is reported as an error.

-In our context, though, such an error can be corrected.  This is done by
-promoting a well-chosen variable among the ones that \texttt{v1} depends on.
-
Promotion is invoked with the use of a hint as well:
\texttt{v2 = hint(v1, promote=True)}.
This hint is a \emph{local} request for \texttt{v2} to be green, without
@@ -222,4 +255,3 @@
approaches to partial evaluation.  See the Promotion section XXX ref for a
complete discussion of promotion.

-

==============================================================================
+++ pypy/extradoc/talk/ecoop2009/main.tex	Wed Dec 17 11:46:18 2008
@@ -36,11 +36,16 @@
\newcommand\anto[1]{\nb{ANTO}{#1}}
\newcommand{\commentout}[1]{}

+\let\oldcite=\cite
+
+\renewcommand\cite[1]{\ifthenelse{\equal{#1}{XXX}}{[citation~needed]}{\oldcite{#1}}}
+
+
\begin{document}
\title{Automatic generation of JIT compilers for dynamic languages
in .NET\thanks{This work has been partially
supported by MIUR EOS DUE - Extensible Object Systems for Dynamic and
-Unpredictable Environments.}}
+Unpredictable Environments.\cfbolz{should we put the PyPy EU project here as well?}}

\author{Davide Ancona\inst{1} \and Carl Friedrich Bolz\inst{2} \and Antonio Cuni\inst{1} \and Armin Rigo}
@@ -78,7 +83,6 @@

\input{abstract}
\input{intro}
-\input{background}
\input{jitgen}
\input{rainbow}
\input{clibackend}

==============================================================================
+++ pypy/extradoc/talk/ecoop2009/rainbow.tex	Wed Dec 17 11:46:18 2008
@@ -14,33 +14,78 @@
The Rainbow bytecode is produced at translation time, when the JIT compiler is
generated.

-Here are summarized the various phases of the JIT:
+\cfbolz{XXX I think we should be very careful with the rainbow interp. it is a
+total implementation-detail and we should only describe it as little as
+possible}

-Translation time:
-
-  * Low-level flowgraphs are produced
-
-  * The *hint-annotator* colors the variables
-
-  * The *rainbow codewriter* translates flowgraphs into rainbow bytecode
-
-
-Compile-time:
-
-  * The rainbow interpreter executes the bytecode
-
-  * As a result, it produces executable code
+\subsection{Example of Rainbow bytecode and execution}

-Runtime:
+TODO

-  * The produced code is executed
+\section{Promotion}

+In the sequel, we describe in more details one of the main new
+techniques introduced in our approach, which we call \emph{promotion}.  In
+short, it allows an arbitrary run-time value to be turned into a
+compile-time value at any point in time.  Promotion is thus the central way by
+which we make use of the fact that the JIT is running interleaved with actual
+program execution. Each promotion point is explicitly defined with a hint that
+must be put in the source code of the interpreter.
+
+From a partial evaluation point of view, promotion is the converse of
+the operation generally known as "lift" \cite{XXX}.  Lifting a value means
+copying a variable whose binding time is compile-time into a variable
+whose binding time is run-time – it corresponds to the compiler
+"forgetting" a particular value that it knew about.  By contrast,
+promotion is a way for the compiler to gain \emph{more} information about
+the run-time execution of a program. Clearly, this requires
+fine-grained feedback from run-time to compile-time, thus a
+dynamic setting.
+
+Promotion requires interleaving compile-time and run-time phases,
+otherwise the compiler can only use information that is known ahead of
+time. It is impossible in the "classical" approaches to partial
+evaluation, in which the compiler always runs fully ahead of execution
+This is a problem in many large use cases.  For example, in an
+interpreter for a dynamic language, there is mostly no information
+that can be clearly and statically used by the compiler before any
+code has run.
+
+A very different point of view on promotion is as a generalization of
+techniques that already exist in dynamic compilers as found in modern
+object-oriented language virtual machines.  In this context feedback
+techniques are crucial for good results.  The main goal is to
+optimize and reduce the overhead of dynamic dispatching and indirect
+invocation.  This is achieved with variations on the technique of
+polymorphic inline caches \cite{XXX}: the dynamic lookups are cached and
+the corresponding generated machine code contains chains of
+compare-and-jump instructions which are modified at run-time.  These
+techniques also allow the gathering of information to direct inlining for even
+better optimization results.
+
+In the presence of promotion, dispatch optimization can usually be
+reframed as a partial evaluation task.  Indeed, if the type of the
+object being dispatched to is known at compile-time, the lookup can be
+folded, and only a (possibly inlined) direct call remains in the
+generated code.  In the case where the type of the object is not known
+at compile-time, it can first be read at run-time out of the object and
+promoted to compile-time.  As we will see in the sequel, this produces
+very similar machine code \footnote{This can also be seen as a generalization of
+a partial evaluation transformation called "The Trick" (see e.g. \cite{XXX}),
+which again produces similar code but which is only applicable for finite sets
+of values.}.
+
+The essential advantage is that it is no longer tied to the details of
+the dispatch semantics of the language being interpreted, but applies in
+more general situations.  Promotion is thus the central enabling
+primitive to make partial evaluation a practical approach to language
+independent dynamic compiler generation.

-\subsection{Example of Rainbow bytecode and execution}
+\subsection{Promotion as Applied to the TLC}

-TODO
+XXX

-\subsection{Promotion}
+\subsection{Promotion in Practise}

There are values that, if known at compile time, allow the JIT compiler to
produce very efficient code.  Unfortunately, these values are tipically red,
@@ -54,7 +99,8 @@
This is done by continuously intermixing compile time and runtime; a promotion
is implemented in this way:

-  * (compile time): the rainbow interpreter produces machine code until it
+\begin{itemize}
+  \item (compile time): the rainbow interpreter produces machine code until it
hits a promotion point; e.g.::

\begin{lstlisting}[language=C]
@@ -62,7 +108,7 @@
return y+10
\end{lstlisting}

-  * (compile time): at this point, it generates special machine code that when
+  \item (compile time): at this point, it generates special machine code that when
reached calls the JIT compiler again; the JIT compilation stops::

\begin{lstlisting}[language=C]
@@ -71,11 +117,11 @@
}
\end{lstlisting}

-  * (runtime): the machine code is executed; when it reaches a promotion
+  \item (runtime): the machine code is executed; when it reaches a promotion
point, it executes the special machine code we described in the previous
point; the JIT compiler is invoked again;

-  * (compile time): now we finally know the exact value of our red variable,
+  \item (compile time): now we finally know the exact value of our red variable,
and we can promote it to green; suppose that the value of 'y' is 32::

\begin{lstlisting}[language=C]
@@ -88,10 +134,83 @@
Note that the operation "y+10" has been constant-folded into "42", as it
was a green operation.

-  * (runtime) the execution restart from the point it stopped, until a new
+  \item (runtime) the execution restart from the point it stopped, until a new
unhandled promotion point is reached.
+\end{itemize}

-\subsection{Virtuals and virtualizables}
+\section{Automatic Unboxing of Intermediate Results}

-\cfbolz{do we even want to talk about virtualizables?}
-TODO
+XXX the following section needs a rewriting to be much more high-level and to
+compare more directly with classical escape analysis
+
+Interpreters for dynamic languages typically allocate a lot of small
+objects, for example due to boxing.  For this reason, we
+implemented a way for the compiler to generate residual memory
+allocations as lazily as possible.  The idea is to try to keep new
+run-time structures "exploded": instead of a single run-time pointer to
+a heap-allocated data structure, the structure is "virtualized" as a set
+of fresh variables, one per field.  In the compiler, the variable that
+would normally contain the pointer to the structure gets instead a
+content that is neither a run-time value nor a compile-time constant,
+but a special \emph{virtual structure} – a compile-time data structure that
+recursively contains new variables, each of which can again store a
+run-time, a compile-time, or a virtual structure value.
+
+This approach is based on the fact that the "run-time values" carried
+around by the compiler really represent run-time locations – the name of
+a CPU register or a position in the machine stack frame.  This is the
+case for both regular variables and the fields of virtual structures.
+It means that the compilation of a \texttt{getfield} or \texttt{setfield}
+operation performed on a virtual structure simply loads or stores such a
+location reference into the virtual structure; the actual value is not
+copied around at run-time.
+
+It is not always possible to keep structures virtual.  The main
+situation in which it needs to be "forced" (i.e. actually allocated at
+run-time) is when the pointer escapes to some non-virtual location like
+a field of a real heap structure.
+
+Virtual structures still avoid the run-time allocation of most
+short-lived objects, even in non-trivial situations.  The following
+example shows a typical case.  Consider the Python expression \texttt{a+b+c}.
+Assume that \texttt{a} contains an integer.  The PyPy Python interpreter
+implements application-level integers as boxes – instances of a
+\texttt{W\_IntObject} class with a single \texttt{intval} field.  Here is the
+
+XXX needs to use TLC examples
+\begin{verbatim}
+    def add(w1, w2):            # w1, w2 are W_IntObject instances
+        value1 = w1.intval
+        value2 = w2.intval
+        result = value1 + value2
+        return W_IntObject(result)
+\end{verbatim}
+
+When interpreting the bytecode for \texttt{a+b+c}, two calls to \texttt{add()} are
+issued; the intermediate \texttt{W\_IntObject} instance is built by the first
+call and thrown away after the second call.  By contrast, when the
+interpreter is turned into a compiler, the construction of the
+\texttt{W\_IntObject} object leads to a virtual structure whose \texttt{intval}
+field directly references the register in which the run-time addition
+put its result.  This location is read out of the virtual structure at
+directly operates on the same register.
+
+An interesting effect of virtual structures is that they play nicely with
+promotion.  Indeed, before the interpreter can call the proper \texttt{add()}
+function for integers, it must first determine that the two arguments
+are indeed integer objects.  In the corresponding dispatch logic, we
+have added two hints to promote the type of each of the two arguments.
+This produces a compiler that has the following behavior: in the general
+case, the expression \texttt{a+b} will generate two consecutive run-time
+switches followed by the residual code of the proper version of
+\texttt{add()}.  However, in \texttt{a+b+c}, the virtual structure representing
+the intermediate value will contain a compile-time constant as type.
+Promoting a compile-time constant is trivial – no run-time code is
+generated.  The whole expression \texttt{a+b+c} thus only requires three
+switches instead of four.  It is easy to see that even more switches can
+be skipped in larger examples; typically, in a tight loop manipulating
+only integers, all objects are virtual structures for the compiler and
+the residual code is theoretically optimal – all type propagation and
+boxing/unboxing occurs at compile-time.