# [pypy-svn] r27931 - pypy/extradoc/talk/dls2006

mwh at codespeak.net mwh at codespeak.net
Tue May 30 18:41:30 CEST 2006

Author: mwh
Date: Tue May 30 18:41:27 2006
New Revision: 27931

Log:
add an initial latex version

on the upside: it compiles
on the downside: all the figures are in verbatim environments (this looks
entertainingly awful), most things that should be \cite{foo}s are just
left in as [foo]s

==============================================================================
--- (empty file)
+++ pypy/extradoc/talk/dls2006/paper.tex	Tue May 30 18:41:27 2006
@@ -0,0 +1,1136 @@
+\documentclass{acm_proc_article-sp}
+
+\begin{document}
+
+\title{Still Missing a Cool Title}
+
+% title ideas: Implementing Virtual Machines in Dynamic Languages?
+
+\maketitle
+
+\begin{abstract}
+The PyPy project seeks to prove both on a research and a
+practical level the feasibility of writing a virtual machine (VM)
+for a dynamic language in a dynamic language -- in this case, Python.
+The aim is to translate (i.e. compile) the VM to arbitrary target
+environments, ranging in level from C/Posix to Smalltalk/Squeak via
+Java and CLI/.NET, while still being of reasonable efficiency within
+these environments.
+
+A key tool to achieve this goal is the systematic reuse of the
+(unmodified, dynamically typed) Python language as a system
+programming language at various levels of our architecture and
+translation process.  For each level, we design a corresponding type
+system and apply a generic type inference engine -- for example, the
+garbage collector is written in a style that manipulates
+simulated pointer and address objects, and when translated to C
+these operations become C-level pointer and address instructions.
+\end{abstract}
+
+\section{Introduction}
+
+Despite the constant trend in the programming world towards
+portability and reusability, there are some areas in which it is still
+notoriously difficult to write flexible, portable, and reasonably
+efficient programs.  The implementation of virtual machines is one
+such area.  Building implementations of general programming languages,
+in particular highly dynamic ones, using a classic direct coding
+approach, is typically a long-winded effort and produces a result that
+is quite [quite could be removed here?] tailored to a specific
+platform and where architectural decisions (e.g. about GC) are spread
+across the code in a pervasive and invasive way.
+
+For this and other reasons, standard platforms emerge; nowadays, a
+language implementer could cover most general platforms in use by
+writing three versions of his virtual machine: for C/Posix, for Java,
+and for CLI/.NET.  This is, at least, the current situation of the
+Python programming language, where independent volunteers have
+developed and are now maintaining Java and .NET versions of Python,
+which follow the evolution of the "official" C version (CPython).
+
+However, we believe that platform standardization does not have to be
+a necessary component of this equation.  We are basically using the
+standard "meta-programming" argument: if one could write the VM in a
+very high level language, then the VM itself could be automatically
+\textit{translated} to any lower-level platform.  Moreover by writing
+the VM in such a language we would gain in flexibility in
+architectural choices and expressiveness.
+
+PyPy achieves this goal without giving up on the efficiency of the
+compiled VMs.
+
+The key factors enabling this result are not to be found in recent
+advances in any particular research area -- we are not for example using
+constraint-based type inference.  Instead, we are following a novel
+overall architecture: it is split into many levels of stepwise
+translation from the high-level source of the VM to the final target
+platform.  Similar platforms can reuse many of these steps, while for
+very different platforms we have the option to perform very different
+translation steps.  Each step reuses a common type inference component
+with a different, ad-hoc type system.
+
+Experiments also suggest a more mundane reason why such an approach is
+only practical today: a typical translation takes about half an hour
+on a modern PC and consumes between 512MB and 1GB of RAM.
+
+We shortly describe the architecture of PyPy in section
+\ref{architecture}.  In section \ref{systemprog} we describe our
+approach of varying the type systems at various levels of the
+translation.  Section \ref{typeinference} gives an overview of the
+type inference engine we developed (and can be read independently from
+section 3.)  We present experimental results in section
+\ref{experimentalresults} and future work directions in section
+\ref{futurework}.  In section \ref{relatedwork} we compare with
+related work, and finally we conclude in section \ref{conclusion}.
+
+\section{Architecture}
+\label{architecture}
+
+There are two major components in PyPy:
+
+\begin{enumerate}
+\item the \textit{Standard Interpreter}: an implementation of the Python programming
+language, mostly complete and compliant with the current version of the
+language, Python 2.4.
+\item the \textit{Translation Process}: a translation tool-suite whose goal is to
+compile subsets of Python to various environment.
+\end{enumerate}
+
+In particular, we have defined a subset of the Python language called
+"restricted Python" or RPython.  This sublanguage is not restricted
+syntactically, but only in the way it manipulates objects of different
+types.  The restrictions are a compromise between the expressivity and
+the need to statically infer enough types to generate efficient code.
+The foremost purpose of the translation tool-suite is to compile such
+RPython programs to a variety of different platforms.
+
+Our current efforts, and the present paper, focus on this tool-suite.
+We will not describe the Standard Interpreter component of PyPy in the
+sequel, other than mention that it is written in RPython and can thus be
+translated.  At close to 90,000 lines of code, it is the largest RPython
+program that we have translated so far.  More information can be found
+in [S].
+
+
+\section{System programming with Python}
+\label{systemprog}
+
+\hypertarget{the-translation-process}{}
+\subsection{The translation process}
+\label{translationprocess}
+
+The translation process starts from RPython source code and eventually
+produces low-level code suitable for the target environment.  It can be
+described as performing a series of step-wise transformations.  Each
+step is based on control flow graph transformations and rewriting, and
+on the ability to augment the program with further implementation code
+written in Python and analysed with the suitable type system.
+
+The front-end part of the translation process analyses the input
+RPython program in two phases, as follows\footnote{Note that the two
+phases are intermingled in time, because type inference proceeds from
+an entry point function and follows all calls, and thus only gradually
+discovers (the reachable parts of) the input program.}:
+
+\begin{verbatim}
+::
+
+  [figure 0: flow graph and annotator, e.g. part of doc/image/translation.*
+             then a stack of transformations
+  ]
+\end{verbatim}
+
+\begin{enumerate}
+\item We take as input RPython functions\footnote{The input to our
+       translation chain are indeed loaded runtime function objects,
+       not source code or ASTs.  This allows us to use unrestricted
+       python for meta-programming purposes at load time, in a
+       seemingly staged programming approach, in which the whole of
+       the source program -- as Python program -- produces the RPython
+       program input to the tool-chain as the object graph loaded in
+       memory.  This includes both the relevant functions and prebuilt
+       data.}, and convert them to control flow graphs -- a structure
+       amenable to analysis.  These flow graphs contain polymorphic
+       operations only: in Python, almost all operations are
+       dynamically overloaded by type.
+
+\item We perform type inference on the control flow graphs.  At this
+      stage, types inferred are part of the type system which is the
+      very definition of the RPython sub-language: they are roughly a
+      subset of Python's built-in types, with some more precision to
+      describe e.g. the items stored in container types.
+      Occasionally, a single input function can produce several
+      specialized versions, i.e. several similar but differently typed
+      graphs.  This type inference process is described in more
+      details in section \ref{typeinference}.
+\end{enumerate}
+
+At the end of the front-end analysis, the input RPython program is
+represented as a forest of flow graphs with typed variables.  Following
+this analysis are a number of transformation steps.  Each transformation
+step modifies the graphs in-place, by altering their structure and/or
+the operations they contain.  Each step inputs graphs typed in one type
+system and leaves them typed in a possibly different type system, as we
+will describe in the sequel.  Finally, a back-end turns the resulting
+graphs into code suitable for the target environment, e.g. C source code
+ready to be compiled.
+
+
+\subsection{Transformations}
+
+When the translation target is C or C-like environments, the first of
+the transformation steps takes the RPython-typed flow graphs, still
+containing polymorphic operations only, and produces flow graphs with
+monomorphic C-like operations and C-like types.  In the simplest case,
+this is the only transformation step: these graphs are directly fed to
+the C back-end, which turns them into ANSI C source code.
+
+But RPython comes with automatic memory management, and this first
+transformation step produces flow graphs that also assume automatic
+memory management.  Generating C code directly from there produces a
+fully leaking program, unless we link it with an external garbage
+collector (GC) like the Boehm conservative GC [Boehm], which is a
+viable option.
+
+We have two alternatives, each implemented as a transformation step.
+The first one inserts naive reference counting throughout the whole
+program's graphs, which without further optimizations gives exceedingly
+bad performance (it should be noted that the CPython interpreter is also
+based on reference counting, and experience suggests that it was not a
+bad choice in this particular case).
+
+The other, and better, alternative is an exact GC, coupled with a
+transformation, the \textit{GC transformer}.  It inputs C-level-typed graphs
+and replaces all \texttt{malloc} operations with calls to a garbage
+collector's innards.  It can inspect all the graphs to discover the
+\texttt{struct} types in use by the program, and assign a unique type id to
+each of them.  These type ids are collected in internal tables that
+describe the layout of the structures, e.g. their sizes and the location
+of the pointer fields.
+
+We have implemented other transformations as well, e.g. performing
+various optimizations, or turning the whole code into a
+continuation-passing style (CPS) [I'm not sure our transformation
+can be classified as classical CPS, although there are known similar techniques but the terminology is quite confused] that allows us to use coroutines
+without giving up the ability to generate fully ANSI C code.  (This will
+be the subject of another paper.)  [mention exception transformer too]
+
+Finally, currently under development is a variant of the very first
+transformation step, for use when targeting higher-level,
+object-oriented (OO) environments.  It is currently being designed
+together with back-ends for Smalltalk/Squeak\footnote{Our simple OO
+type system is designed for \textit{statically-typed} OO environments,
+including Java; the presence of Smalltalk as a back-end might be
+misleading in that respect.} and CLI/.NET.  This first transformation
+step, for C-like environments, is called the \textit{LLTyper}: it produces
+C-level flow graphs, where the object-oriented features of RPython
+(classes and instances) become manipulations of C structs with
+explicit virtual table pointers.  By contrast, for OO environments the
+transformation step is called the \textit{OOTyper}: it targets a simple
+object-oriented type system, and preserves the classes and instances
+of the original RPython program.  The LLTyper and OOTyper still have
+much code in common, to convert the more Python-specific features like
+its complex calling conventions.
+
+More information about these transformations can be found in [T].
+
+
+\subsection{System code}
+
+A common pattern in all the transformation steps is to somehow lower the
+level at which the graphs are currently expressed.  Because of this,
+there are operations that were atomic in the input (higher-level) graphs
+but that need to be decomposed into several operations in the target
+(lower-level) graphs.  In some cases, the equivalent functionality
+requires more than a couple of operations: a single operation must be
+replaced by a call to whole new code -- functions and classes that serve
+as helpers.  An example of this is the \texttt{malloc} operation for the GC
+transformer.  Another example is the \texttt{list.append()} method, which is
+atomic for Python or RPython programs, but needs to be replaced in
+C-level code by a helper that possibly reallocates the array of items.
+
+This means that in addition to transforming the existing graphs, each
+transformation step also needs to insert new functions into the forest.
+A key feature of our approach is that we can write such "system-level"
+code -- relevant only to a particular transformation -- in plain Python
+as well:
+
+\begin{verbatim}
+.. topic:: Figure 1 - a helper to implement \texttt{list.append()}
+
+  ::
+
+    def ll_append(lst, newitem):
+        # Append an item to the end of the vector.
+        index = lst.length         # get the 'length' field
+        ll_resize(lst, index+1)    # call a helper not shown here
+        itemsarray = lst.items     # get the 'items' field
+        itemsarray[index] = item   # this behaves like a C array
+\end{verbatim}
+
+The idea is to feed these new Python functions into the front-end, using
+this time the transformation's target (lower-level) type system during
+the type inference.  In other words, we can write plain Python code that
+manipulates objects that conform to the lower-level type system, and
+have these functions automatically transformed into appropriately typed
+graphs.
+
+For example, \texttt{ll\textunderscore{}append()} in figure 1 is a Python function
+that manipulates objects that behave like C structures and arrays.
+This function is inserted by the LLTyper, as a helper to implement the
+\texttt{list.append()} calls found in its RPython-level input graphs.
+By going through the front-end reconfigured to use C-level types, the
+above function becomes a graph with such C-level types\footnote{The
+low-level type system specifies that the function should be
+specialized by the C-level type of its input arguments, so it actually
+turns into one graph per list type -- list of integers, list of
+pointers, etc.  This behavior gives the programmer a feeling
+comparable to C++ templates, without the declarations.}, which is then
+indistinguishable from the other graphs of the forest produced by the
+LLTyper.
+
+In the example of the \texttt{malloc} operation, replaced by a call to GC
+code, this GC code can invoke a complete collection of dead objects, and
+can thus be arbitrarily complicated.  Still, our GC code is entirely
+written in plain Python, and it manipulates "objects" that are still at
+a lower level: pointer and address objects.  Even with the restriction
+of having to use pointer-like and address-like objects, Python remains
+more expressive than, say, C to write a GC.  [see also Jikes]
+
+In the sequel, we will call \textit{system code} functions written in
+Python that are meant to be analysed by the front-end.  For the
+purpose of this article we will restrict this definition to helpers
+introduced by transformations, as opposed to the original RPython
+program, although the difference is not fundamental to the translation
+process (and although our input RPython program, as seen in section
+\ref{architecture}, is often itself a Python virtual machine!).
+
+Note that such system code cannot typically be expressed as normal
+RPython functions, because it corresponds to primitive operations at
+that level.  As an aside, let us remark that the number of primitive
+operations at RPython level is, comparatively speaking, quite large:
+all list and dictionary operations, instance and class attribute
+accesses, many string processing methods, a good subset of all Python
+built-in functions...  Compared to other approaches [e.g. Squeak], we
+do not try to minimize the number of primitives -- at least not at the
+source level.  It is fine to have many primitives at any high enough
+level, because they can all be implemented at the next lower level in
+a way that makes sense to that level.  The key reason why this is not
+burdensome is that the lower level implementations are also written in
+Python -- with the only difference that they use (and have to be
+typeable in) the lower-level type system\footnote{This is not strictly
+true: the type systems are even allowed to co-exist in the same
+function.  The operations involving higher-level type systems are
+turned into lower-level operations by the previous transformations in
+the chain, which leave the already-low-level operations untouched.}.
+
+
+\subsection{Type systems}
+
+The four levels that we considered so far are summarized in figure 2.
+
+\begin{verbatim}
+::
+
+    [figure 2:    RPython
+                  /     \
+                 /       \
+       LLTypeSystem     OOTypeSystem
+               /
+              /
+    ]
+\end{verbatim}
+
+The RPython level is a subset of Python, so the types mostly follow
+Python types, and the instances of these types are instances in the
+normal Python sense; e.g. whereas Python has only got a single type
+\texttt{list}, RPython has a parametric type \texttt{list(T)} for every RPython
+type \texttt{T}, but instances of \texttt{list(T)} are just those Python lists
+whose items are all instances of \texttt{T}.
+
+The other type systems, however, do not correspond to built-in Python
+types.  For each of them, we implemented:
+
+\begin{enumerate}
+\item the types, which we use to tag the variables of the graphs at
+      the given level.  (Types are actually annotated, self-recursive
+      formal terms, and would have been implemented simply as such if
+      Python supported them directly.)
+
+\item the Python objects that emulate instances of these types.  (More
+      about them below.)
+\end{enumerate}
+
+We have defined well-typed operations between variables of these types,
+plugging on the standard Python operators.  These operations are the
+ones that the emulating instances implement.  As seen above, the types
+can also be used by type inference when analysing system code like the
+helpers of figure 1.
+
+Now, clearly, the purpose of types like a "C-like struct" or a "C-like
+array" is to be translated to a real \texttt{struct} or array declaration by
+the C back-end.  What, then, is the purpose of emulating such things in
+Python?  The answer is three-fold.  Firstly, if we have objects that
+live within the Python interpreter, but faithfully emulate the behavior
+of their C equivalent while performing additional safety checks, they
+are an invaluable help for testing and debugging.  For example, we can
+check the correctness of our hash table implementation, written in
+Python in term of struct- and array-like objects, just by running it.
+The same holds for the GC.
+
+Secondly, and anecdotically, as the type inference process (section
+\ref{typeinference}) is based on abstract interpretation, we can use
+the following trick: the resulting type of most low-level operations
+is deduced simply by example.  Sample C-level objects are
+instantiated, used as arguments to a given operation, and produce a
+sample result, whose C-level type must be the type of the result
+variable in the graph.
+
+The third reason is fundamental: we use these emulating objects to
+\textit{represent} pre-built objects at that level.  For example, the GC
+transformer instantiates the objects emulating C arrays for the internal
+type id tables, and it fills them with the correct values.  These array
+objects are then either used directly when testing the GC, or translated
+by the C back-end into static pre-initialized arrays.
+
+
+
+\section{Type inference}
+\label{typeinference}
+
+
+The various analyses used -- from type inference to lifetime analysis -
+are generally formulated as [abstract interpretation].  While this
+approach is known to be less efficient than more tailored algorithms
+like constraint-based type inference, we gain in freedom,
+controllability and simplicity.  This proved essential in our overall
+approach: as described in section \ref{systemprog}, we need to perform
+type inference with many different type systems, the details of which
+have evolved along the road.
+
+We mitigate the potential efficiency problem by wise choices and
+compromises for the domain used; the foremost example of this is that
+our RPython type inference performs almost no automatic specialization
+of functions XXX class annotation vs concrete runtime class setsXXX.
+We achieved enough precision for our purpose, though.
+
+In the sequel, we give a more precise description of this process and
+justify our claim that good performance and enough precision can be
+achieved -- at least in some contexts -- without giving up the naive but
+flexible approach.
+
+
+\subsection{Building control flow graphs}
+\label{flowobjspace}
+\hypertarget{flowobjspace}{}
+
+As described in the overview of \href{\#the-translation-process}{the
+translation process}, the front-end of the translation tool-chain
+works in two phases: it first builds control flow graphs from Python
+functions, and then performs whole-program type inference on these
+graphs.
+
+Remember that building the control flow graphs is not done, as one might
+first expect, by following a function at the syntactic level.  Instead,
+the whole program is imported in a normal Python interpreter; the full
+Python language is used at this point as a kind of preprocessor with
+meta-programming capabilities.  Once the program is imported, the object
+data in memory consists of Python function objects in bytecode format,
+and any other kind of objects created at import-time, like class
+objects, prebuilt instances of those, prebuilt tables, and so on.  Note
+that these objects have typically no text representation any more; for
+example, cyclic data structures may have been built at this point.  The
+translation tool-chain first turns these function objects into in-memory
+control flow graphs which contain direct references to the prebuilt data
+objects, and then handles and transforms these graphs.
+
+We found in-process debugging sufficient and did not implement dumping
+of any intermediate step to disk.  Figure 3 shows the control flow graph
+obtained for a simple function -- this is a screenshot from our graph
+viewer, used for debugging; basic block placement is performed by
+[Graphviz].
+
+\begin{verbatim}
+::
+
+    [figure 3: insert a nice pygame screenshot]
+\end{verbatim}
+
+The actual transformation from function objects -- i.e. bytecode -- to
+flow graph is performed by the Flow Object Space, a short but generic
+plug-in component for the Python interpreter of PyPy.  The architecture
+of our Python interpreter is shown in figure 4.
+
+\begin{verbatim}
+.. topic:: Figure 4 - the interpreter and object spaces
+
+    +------------------------------------------------------+
+    |   forest of bytecode objects from the application    |
+    +------------------------------------------------------+
+    |               Python bytecode interpreter            |
+    +--------------------------------+---------------------+
+    |      Standard Object Space     |  Flow Object Space  |
+    +--------------------------------+---------------------+
+\end{verbatim}
+
+Note that the left column, i.e. the bytecode interpreter and the
+Standard Object Space, form the full Python interpreter of PyPy.  It is
+an RPython program, and the whole purpose of the translation process is
+to accept this as \textit{input}, and translate it to an efficient form.  Its
+architecture is not relevant to the way it is translated. XXX?
+
+However, the bytecode interpreter plays a double role, at two different
+levels.  The so-called Object Spaces are \textit{domains} in the abstract
+interpretation terminology.  By design, we cleanly separated these
+domains from the bytecode interpreter core; the latter is only
+responsible for decoding the bytecodes of an application and emulating
+the corresponding stack machine.  It treats all actual application-level
+objects as black boxes, and dispatches all operations on them to the
+Object Space.  The Standard Object Space is a concrete domain, in which
+objects are the concrete Python objects of the various built-in types:
+lists, dictionaries, and so on.  By opposition, the Flow Object Space is
+really an abstract domain.  It handles objects that are placeholders.
+Its lattice order is shown in figure 5.
+
+\begin{verbatim}
+::
+
+    [figure 5:                Variable
+
+                         /      |             \           \
+                        /       |              \           \
+                       /        |               \           \
+      Constant(1) ... Constant(n) ... Constant([1,2,3])  ... Constant(<instance of class A>) ...
+    ]
+\end{verbatim}
+
+This order is extremely simple, because most actual analysis is delayed
+to the next phase, the type inference engine.  The objects are either
+\textit{Variables}, which are pure placeholders for entierely unknown values,
+or \textit{Constants} with a concrete Python object as value.  The order places
+Variable as the top, and keeps all \textit{Constants} unordered.  Thus if two
+different constants merge during abstract interpretation, we immediately
+widen them to Variable.
+
+In conjunction with the Flow Object Space, the bytecode interpreter of
+PyPy thus performs abstract interpretation of Python bytecodes from
+the application\footnote{Note that this process uses the
+\textit{unmodified} bytecode interpreter.  This means that it is
+independent of most language details.  Changes in syntax or in
+bytecode format or opcode semantics only need to be implemented once,
+in the bytecode interpreter.  In effect, the Flow Object Space enables
+an interpreter for \textit{any} language to work as a front-end for
+the rest of the tool-chain.}.  In this case, the bytecodes in
+question come from the RPython application that we would like to
+translate.
+
+The Flow Object Space records all operations that the bytecode
+interpreter "would like" to do between the placeholder objects.  It
+records them into basic block objects that will eventually be part of
+the control flow graph of the whole function.  The recorded operations
+take Variables and Constants as argument, and produce new Variables as
+results.  The Constants serve two purposes: they are a way to
+introduce constant values into the flow graphs -- these values may be
+arbitrarily complex objects, not just primitives -- and they allow
+basic constant propagation\footnote{This is useful at this level for
+some constructs of the bytecode interpreter, which can temporarily
+wrap internal values and push them onto the regular value stack among
+the other application-level objects.  We need to be able to unwrap
+them again later. XXX compile-time computation in the helpers.}
+
+In the flow graph, branching occurs when the bytecode interpreter tries
+to inspect the truth value of placeholder objects, as it would in
+response to conditional jump opcodes or other more complicated opcodes:
+at this point, the Flow Object Space starts two new basic blocks and -
+with a technique akin to continuations -- tricks the interpreter into
+following both branches, one after the other.  Additionally, the
+bytecode interpreter sends simple positional signals that allow the Flow
+Object Space to detect when control paths merge, or when loops close.
+In this way, abstract interpretation quickly terminates and the recorded
+operations form a graph, which is the control flow graph of the original
+bytecode.
+
+Note that we produce flow graphs in Static Single Information (SSI)
+form, an extension of Static Single Assignment ([SSA]): each variable is
+only used in exactly one basic block.  All variables that are not dead
+at the end of a basic block are explicitly carried over to the next
+block and renamed.
+
+While the Flow Object Space is quite a short piece of code -- its core
+functionality holds in 300 lines -- the detail of the interactions
+sketched above is not entierely straightforward; we refer the reader to
+
+
+\subsection{The Annotator}
+
+The type inference engine, which we call the \textit{annotator}, is the central
+component of the front-end part of the translation process.  Given a
+program considered as a family of control flow graphs, the annotator
+assigns to each variable of each graph a so-called \textit{annotation}, which
+describes the possible run-time objects that this variable can contain.
+Following usual terminology, we will call such annotations \textit{types} -- not
+to be confused with the Python notion of the concrete type of an object.
+An annotation is a set of possible values, and such a set is not always
+the set of all objects of a specific Python type.
+
+Here is a simplified, static model of how the annotator works.  It can
+be considered as taking as input a finite family of functions calling
+each other, and working on the control flow graphs of each of these
+functions as built by the \href{flowobjspace}{Flow Object Space}.
+Additionally, for a particular "entry point" function, the annotator
+is provided with user-specified types for the function's arguments.
+
+The goal of the annotator is to find the most precise type that can be
+given to each variable of all control flow graphs while respecting the
+constraints imposed by the operations in which these variables are
+involved.
+
+More precisely, it is usually possible to deduce information about the
+result variable of an operation given information about its arguments.
+For example, we can say that the addition of two integers must be an
+integer.  Most programming languages have this property.  However,
+Python -- like many languages not specifically designed with type
+inference in mind -- does not possess a type system that allows much
+useful information to be derived about variables based on how they are
+\textit{used}; only on how they were \textit{produced}.  For example, a number of very
+different built-in types can be involved in an addition; the meaning of
+the addition and the type of the result depends on the type of the input
+arguments.  Merely knowing that a variable will be used in an addition
+does not give much information per se.  For this reason, our annotator
+works by flowing types forward, operation after operation, i.e. by
+performing abstract interpretation of the flow graphs.  In a sense, it
+is a more naive approach than the one taken by type systems specifically
+designed to enable more advanced inference algorithms.  For example,
+Hindley-Milner type inference works in an inside-out direction, by
+starting from individual operations and propagating type constraints
+outwards [H-M].
+
+Naturally, simply propagating types forward requires the use of a fixed
+point algorithm in the presence of loops in the flow graphs or in the
+inter-procedural call graph.  Indeed, we flow types forward from the
+beginning of the entry point function into each basic block, operation
+after operation, and follow all calls recursively.  During this process,
+each variable along the way gets a type.  In various cases, e.g. when we
+close a loop, the previously assigned types can be found to be too
+restrictive.  In this case, we generalise them to allow for a larger set
+of possible run-time values, and schedule the block where they appear
+for reflowing.  The more general type can generalise the types of the
+results of the variables in the block, which in turn can generalise the
+types that flow into the following blocks, and so on.  This process
+continues until a fixed point is reached.
+
+We can consider that all variables are initially assigned the "bottom"
+type corresponding to the empty set of possible run-time values.  Types
+can only ever be generalised, and the model is simple enough to show
+that there is no infinite chain of generalisation, so that this process
+necessarily terminates.
+
+
+\subsection{RPython types}
+
+As seen in section \ref{systemprog}, we use the annotator with more than one type
+systems.  The more interesting and complex one is the RPython type
+system, which describes how the input RPython program can be annotated.
+The other type systems contain lower-level, C-like types that are mostly
+unordered, thus forming more trivial lattices than the one formed by
+RPython types.
+
+The set $A$ of RPython types is defined as the following formal terms:
+
+\begin{itemize}
+\item $Bot$, $Top$ -- the minimum and maximum elements (corresponding
+      to "impossible value" and "most general value");
+
+\item $Int$, $NonNegInt$, $Bool$ -- integers, known-non-negative
+      integers, booleans;
+
+\item $Str$, $Char$ -- strings, characters (which are strings of
+      length 1);
+
+\item $Inst(class)$ -- instance of $class$ or a subclass thereof
+      (there is one such term per $class$);
+
+\item $List(v)$ -- list; $v$ is a variable summarising the items of
+      the list (there is one such term per variable);
+
+\item $Pbc(set)$ -- where the $set$ is a subset of the (finite) set of
+      all prebuilt constant objects.  This set includes all the
+      callables of the input program: functions, classes, and methods.
+
+\item $None$ -- stands for the singleton \texttt{None} object of
+      Python.
+
+\item $NullableStr$, $NullableInst(class)$ -- a string or
+      \texttt{None}; resp. an instance or \texttt{None}.
+\end{itemize}
+
+Figures 6 and 7 shows how these types are ordered to form a lattice.  We
+mostly use its structure of [join-semilattice] only.
+
+\begin{verbatim}
+.. graphviz:: image/lattice1.dot
+
+:Figure 6: the lattice of annotations.
+\end{verbatim}
+
+\begin{verbatim}
+.. graphviz:: image/lattice2.dot
+
+:Figure 7: The part about instances and nullable instances, assuming a
+           simple class hierarchy with only two direct subclasses of
+           \texttt{object}.
+\end{verbatim}
+
+
+All list terms for all variables are unordered.  The Pbcs form a
+classical finite set-of-subsets lattice.  In addition, we have left
+out a number of other annotations that are irrelevant for the basic
+description of the annotator and straightforward to handle:
+$Dictionary$, $Tuple$, $Float$, $UnicodePoint$, $Iterator$, etc.  The
+complete list is described in [T].
+
+The type system moreover comes with a family of rules, which for every
+operation and every sensible combination of input types describes the
+type of its result variable.  Let $V$ be the set of Variables that
+appear in the user program's flow graphs.  Let $b$ be a map from $V$
+to $A$; it is a "binding" that gives to each variable a type.  The
+purpose of the annotator is to compute such a binding stepwise.
+
+Let $x$, $y$ and $z$ be Variables.  We introduce the rule:
+
+% XXX format this!
+$$+\begin{array}{c} + z = \mathrm{add}(x, y), \; b(x) = + Int, \; Bool \leq b(y) \leq Int \\ \hline + b' = b \hbox{\ with\ } (z + \rightarrow Int) +\end{array} +$$
+
+which specify that if we see the addition operation applied to
+Variables whose current binding is $Int$, a new binding $b'$ can be
+produced: it is $b$ except on $z$, where we have $b'(z) = Int$.
+
+The type inference engine can be seen as applying this kind of rules
+repeatedly.  It does not apply them in random order, but follows a
+forward-propagation order typical of abstract interpretation.
+
+It is outside the scope of the present paper to describe the type
+inference engine and the rules more formally.  The difficult points
+involve mutable containers -- e.g. initially empty list that are filled
+somewhere else -- and the discovery of instance attributes -- in Python,
+classes do not declare upfront a fixed set of attributes for their
+instances, let alone their types.  Both these examples require
+sophisticated reflowing techniques that invalidate existing types in
+already-annotated basic blocks, to account for the influence of more
+general types coming indirectly from a possibly distant part of the
+program.  The reader is referred to [D] for more information.
+
+
+\subsection{Termination and complexity}
+
+The lattice model clearly contains no infinite chain.  Moreover, it is
+designed to convince oneself that the number of reflowings required in
+practice is small.  For example, we are not trying to do range analysis
+beyond detecting non-negatives -- the reason is that range analysis is
+not an essential part of writing \textit{reasonably} efficient code.  Consider
+that a translated RPython program runs hundreds of times faster than
+when the same program is executed by the standard Python interpreter: in
+this context, range analysis appears less critical.  It is a possible
+optimization that we can introduce in a later, optional analysis and
+transformation step.
+
+The worst case behaviors that can appear in the model described above
+involve the lattice of Pbcs, involving variables that could contain
+e.g.  one function object among many.  An example of such behavior is
+code manipulating a table of function objects: when an item is read
+out of the table, its type is a large Pbc set: $Pbc(\{f1, f2, f3, +\ldots\})$.  But in this example, the whole set is available at once,
+and not built incrementally by successive discoveries.  This seems to
+be often the case in practice: it is not very common for programs to
+manipulate objects that belong to a large but finite family -- and when
+they do, the whole family tends to be available early on, requiring
+few reflowing.
+
+This means that \textit{in practice} the complete process requires a time that
+is far lower than the worst case.  We have experimentally confirmed
+this: annotating the whole PyPy interpreter (90,000 lines) takes on the
+order of 5 to 10 minutes, and basic blocks are typically only reflown a
+handful of times, providing a close-to-linear practical complexity.
+
+We give formal termination and correctness proofs in [D], as well as
+worst-case bounds and some experimental evidence of their practical
+irrelevance.
+
+
+\subsection{Precision}
+
+Of course, this would be pointless if the annotation did not give
+precise enough information for our needs.  We must describe a detail of
+the abstract interpretation engine that is critical for precision: the
+propagation of conditional types.  Consider the following source code
+fragment::
+
+    if isinstance(x, MyClass):
+        f(x)
+    else:
+        g(x)
+
+Although the type of \texttt{x} may be some parent class of
+\texttt{MyClass}, it can be deduced to be of the more precise type
+$Inst(MyClass)$ within the positive branch of the \texttt{if}.
+(Remember that our graphs are in SSI form, which means that the
+\texttt{x} inside each basic block is a different Variable with a
+possibly different type as annotation.) XXX flow sensivity and SSA
+
+This is implemented by introducing an extended family of types for
+boolean values:
+
+$$+Bool(v_1: (t_1, f_1), v_2: (t_2, f_2), ...) +$$
+
+where the $v_n$ are variables and $t_n$ and $f_n$ are types.  The
+result of a check, like \texttt{isintance()} above, is typically
+annotated with such an extended $Bool$.  The meaning of the type is as
+follows: if the run-time value of the boolean is True, then we know
+that each variable $v_n$ has a type at most as general as $t_n$; and
+if the boolean is False, then each variable $v_n$ has a type at most
+as general as $f_n$.  This information is propagated from the check
+operation to the exit of the block via such an extended $Bool$ type,
+and the conditional exit logic in the type inference engine uses it to
+trim the types it propagates into the next blocks (this is where the
+\textit{meet} of the lattice is used).
+
+With the help of the above technique, we achieve a reasonable precision
+in small examples.  For larger examples, a different, non-local
+technique is required: the specialization of functions. XXX use 'polymorphism'
+
+As described in the introduction, the most important downside of our
+approach is that automatic specialization is a potential
+performance-killer.  We \textit{do} support specialization, however: we can
+generate several independently-annotated copies of the flow graphs of
+certain functions.  When annotating RPython programs, such
+specialization does not happen automatically: we rely on hints provided
+by the programmer in the source code, in the form of flags attached to
+function objects.  As we had this trade-off in mind when we wrote the
+Python interpreter of PyPy, we only had to add a dozen or so hints in
+the end.
+
+This does not mean that automatic specialization policies are difficult
+to implement.  Indeed, the simpler lower-level type systems rely quite
+heavily on them: this is because the system code helpers are often
+generic and can receive arguments of various C-level types.  In this
+case, because the types at this level are limited and mostly unordered,
+specializing all functions on their input argument's types works well.
+
+At the level of RPython, on the other hand, the range of specializations
+that make sense is much wider.  We have used anything between
+specialization by the type of an argument, to specialization by an
+expected-to-be-constant argument value, to memoized functions that the
+type inference engine will actually call during annotation and replace
+by look-up tables, to complete overriding of the annotator's behavior in
+extreme cases.  In this sense, the need for manual specialization turned
+into an advantage, in term of simplicity and flexibility of implementing
+and using new specialization schemes.
+
+This conclusion can be generalized.  We experimented with a simple
+approach to type inference that works well in practice, and that can
+very flexibly accomodate changes in the type system and even completely
+different type systems.  We think that the reasons for this success are
+to be found on the one hand in the (reasonable) restrictions we put on
+ourselves when designing the RPython language and writing the Python
+interpreter of PyPy in RPython, and on the other hand in an ad-hoc type
+system that is designed to produce enough precision (but not more) for
+the purpose of the subsequent transformations to C-level code.
+
+We should mention that restricting oneself to write RPython code instead
+of Python is still a burden, and that we are not proposing in any way
+that the Python language itself should evolve in this direction, nor
+even that RPython usage should become widespread.  It is a tool designed
+with a specific goal in mind, which is the ability to produce reasonably
+efficient, stand-alone code in a large variety of environment.
+
+
+
+\section{Experimental results}
+\label{experimentalresults}
+
+
+\subsection{Performance}
+
+Our tool-chain is capable of translating the Python interpreter of
+PyPy, written in RPython, producing right now either ANSI C code as
+described before, or LLVM\footnote{the LLVM project is the realisation
+of a portable assembler infrastructure, offering both a virtual
+machine with JIT capabilities and static compilation. Currently we are
+using the latter with its good high-level optimisations for PyPy.}
+assembler which is then natively compiled with LLVM tools.
+
+The tool-chain has been tested with and can sucessfully apply
+transformations enabling various combinations of features. The
+translated interpreters are benchmarked using pystone (a [Dhrystone]
+derivative traditionally used by the Python community, although it is a
+rather poor benchmark) and the classical [Richards] benchmark and
+compared against [CPython] 2.4.3 results:
+
+\begin{verbatim}
++------------------------------------+-------------------+-------------------+
+|  Interpreter                       | Richards,         | Pystone,          |
+|                                    | Time/iteration    | Iterations/second |
++====================================+===================+===================+
+|  CPython 2.4.3                     |   789ms    (1.0x) |   40322    (1.0x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-c                            |  4269ms    (5.4x) |    7587    (5.3x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-c-thread                     |  4552ms    (5.8x) |    7122    (5.7x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-c-stackless                  |  XXX       (6.0x) |    XXX     (6.2x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-c-gcframework                |  6327ms    (8.0x) |    4960    (8.1x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-c-stackless-gcframework      |  XXX       (    ) |    XXX     (    ) |
++------------------------------------+-------------------+-------------------+
+|  pypy-llvm-c                       |  3797ms    (4.8x) |    7763    (5.2x) |
++------------------------------------+-------------------+-------------------+
+|  pypy-llvm-c-prof                  |  2772ms    (3.5x) |   10245    (3.9x) |
++------------------------------------+-------------------+-------------------+
+\end{verbatim}
+
+The numbers in parenthesis are slow-down factors compared to CPython.
+These measures reflect PyPy revision 27815, compiled with GCC 3.4.4.
+LLVM is version 1.8cvs (May 11, 2006).  The machine runs GNU/Linux SMP
+on an Intel(R) Pentium(R) 4 CPU at 3.20GHz with 2GB of RAM and 1MB of
+cache.  The rows correspond to variants of the translation process, as
+follows:
+
+pypy-c
+    The simplest variant: translated to C code with no explicit memory
+    management, and linked with the Boehm conservative GC [Boehm].
+
+    The same, with OS thread support enabled.  (For measurement purposes,
+    thread support is kept separate because it has an impact on the GC
+    performance.)
+
+pypy-c-stackless
+    The same as pypy-c, plus the "stackless transformation" step which
+    modifies the flow graph of all functions in a way that allows them
+    to save and restore their local state, as a way to enable coroutines.
+
+pypy-c-gcframework
+    In this variant, the "gc transformation" step inserts explicit
+    memory management and a simple mark-and-sweep GC implementation.
+    The resulting program is not linked with Boehm.  Note that it is not
+    possible to find all roots from the C stack in portable C; instead,
+    in this variant each function explicitly pushes and pops all roots
+    to an alternate stack around each subcall.
+
+pypy-c-stackless-gcframework
+    This variant combines the "gc transformation" step with the
+    "stackless transformation" step.  The overhead introduced by the
+    stackless feature is balanced with the removal of the overhead of
+    pushing and popping roots explicitly on an alternate stack: indeed,
+    in this variant it is possible to ask the functions in the current C
+    call chain to save their local state and return.  This has the
+    side-effect of moving all roots to the heap, where the GC can find
+    them.
+
+pypy-llvm-c
+    The same as pypy-c, but using the LLVM back-end instead of the C
+    back-end.  The LLVM assembler-compiler gives the best results when -
+    as we do here -- it optimizes its input and generates again C code,
+    which is fed to GCC.
+
+pypy-llvm-c-prof
+    The same as pypy-llvm-c, but using GCC's profile-driven
+    optimizations.
+
+The speed difference with CPython 2.4.3 can be explained at two levels.
+One is that CPython is hand-crafted C code that has been continuously
+optimized for a decade now, whereas the Python interpreter of PyPy first
+seeks flexibility and high abstraction levels.  The other, probably
+dominant, factor is that various indices show that our approach places a
+very high load on the GC and on the memory caches of the machine.  The
+Boehm GC is known to be less efficient than more customized approach;
+kernel-level profiling shows that pypy-c typically spends 30% of its
+time in the Boehm library.  Our current, naively simple mark-and-sweep
+GC is even quite worse.  The interaction with processor caches is also
+hard to predict and account for; in general, we tend to produce
+relatively large amounts of code and prebuilt data.
+
+
+\subsection{Translation times}
+
+A complete translation of the pypy-c variant takes about 39 minutes,
+divided as follows:
+
+\begin{verbatim}
++-------------------------------------------+------------------------------+
+| Step                                      |   Time (minutes:seconds)     |
++===========================================+==============================+
+| Front-end                                 |            9:01              |
+| (flow graphs and type inference)          |                              |
++-------------------------------------------+------------------------------+
+| LLTyper                                   |           10:38              |
+| (from RPython-level to C-level graphs     |                              |
+| and data)                                 |                              |
++-------------------------------------------+------------------------------+
+| Various low-level optimizations           |            6:51              |
+| (convert some heap allocations to local   |                              |
+| variables, inlining, ...)                 |                              |
++-------------------------------------------+------------------------------+
+| Database building                         |            8:39              |
+| (this initial back-end step follows all   |                              |
+| graphs and prebuilt data structures       |                              |
+| recursively, assigns names, and orders    |                              |
+| them suitably for code generation)        |                              |
++-------------------------------------------+------------------------------+
+| Generating C source                       |            2:25              |
++-------------------------------------------+------------------------------+
+| Compiling (\texttt{gcc -O2})              |            3:23              |
++-------------------------------------------+------------------------------+
+\end{verbatim}
+
+An interesting feature of this table is that type inference is not the
+bottleneck.  Indeed, further transformation steps typically take longer
+than type inference alone.  This is the case for the LLTyper step,
+although it has a linear complexity on the size of its input (most
+transformations do).
+
+Other transformations like the "gc" and the "stackless" ones actually
+take more time, particuarly when used in combination with each other (we
+speculate it is because of the increase in size caused by the previous
+transformations).  A translation of pypy-c-stackless, without counting
+GCC time, takes 60 minutes; the same for pypy-c-stackless-gcframework
+takes XXX minutes.
+
+
+
+\section{Future work}
+\label{futurework}
+
+
+As described in section \ref{experimentalresults}, the performance of
+the compiled Python interpreters is still not up to competing with the
+well-established CPython.  We are always working to improve matters,
+considering new optimizations and better GCs.  Also, the OOTyper and
+back-ends for Smalltalk/Squeak and CLI/.NET are currently in progress.
+
+
+\subsection{JIT Specializer}
+
+So far, the PyPy tool-chain can only translate the Python interpreter of
+PyPy into a program which is again an interpreter -- the same interpreter
+translated to C, essentially, although we have already shown that some
+aspects can be "weaved" in at translation time, like support for
+coroutines.
+
+To achieve high performance for dynamic languages such as Python, the
+proven approach is to use dynamic compilation techniques, i.e. to write
+JITs.  With direct techniques, this is however a major endeavour, and
+increases the efforts to further evolve the language.
+
+In the context of the PyPy project, we are now exploring -- as we planned
+from the start -- the possibility to produce a JIT as a graph
+transformation aspect from the Python interpreter.  This idea is based
+on the theoretical possibiliy to turn interpreters into compilers by
+partial evaluation [PE].  In our approach, this is done by analysing
+the forest of flow graphs built from the Python interpreter, which is a
+well-suited input for this kind of techniques.  We can currently perform
+binding-time analysis [BTA] on these graphs, again with abstract
+interpretation techniques reusing the type inference engine.  The next
+step is to transform the graphs -- following the binding-time annotations
+- into a compiler; more precisely, in partial evalution terminology, a
+generating extension.  We can currently do this on trivial examples.
+
+The resulting generating extension will be essentially similar to
+[Psyco], which is the only (and hand-written) JIT available for Python so
+far, based on run-time specialization.
+
+
+
+\section{Related work}
+\label{relatedwork}
+
+Applying the expressiveness or at least syntax of very high-level and
+dynamically typed languages to their implementation has been
+investigated many times.
+
+One typical approach is writing a static compiler.  The viability of
+and effort required for such an approach depend usually on the binding
+and dispatch semantics of the language.  Common Lisp native compilers,
+usable interactively and taking functions or files as compilation
+units are a well-known example of that approach.  Late binding for all
+names and load semantics make such an approach very hard for Python,
+if speed improvements are desired.
+
+It is more relevant to consider and compare with projects using
+dynamic and very-high level languages for interpreters and VM
+implementations, and Just-In-Time compilers.
+
+[Scheme48] was a Scheme implementation using a restricted Scheme, with
+static type inference based on Hindley-Milner, this is viable for
+Scheme as base language. Portability and simplicity were its major
+goals.
+
+[Squeak] is a Smalltalk implementation in Smalltalk. It uses SLang, a
+very restricted subset of Smalltalk with few types and strict
+conventions such that mostly direct translation to C is possible.
+Both the VM and the object memory and garbage collector support are
+explicitly written together in this style. Again simplicity and
+portability were the major goals, not sophisticated manipulation and
+analysis and waving in of features.
+
+[Jikes RVM] is a Java VM and Just-In-Time compiler written in Java.
+Bootstrapping happens by self-applying the compiler on a host VM, and
+dumping a snapshot from memory of the resulting native code.
+
+This approach enables directly high performance, at the price of
+portability as usual with pure native code emitting
+approaches. Modularity of features, when possible, is achieved with
+normal software modularity. The indirection costs are taken care by
+the inlining done by the compiler, sometimes even through explicit
+ways to request for it. In particular this modular approach is used
+for implementing a range of choices for GC support XXX ref.  This was
+the inspiration for PyPy own GC framework, although much more tuning
+and work went into Jikes RVM. PyPy own GC framework also exploits
+inlining of helpers and barriers to recover performance.
+
+Jikes RVM native JIT compilers can likely not easily be retargeted to
+run on top and target another VM (for example a CLR runtime) instead
+of hardware processors. Also Jikes RVM pays the complexity of writing
+a JIT up-front, which also means that features and semantics of the
+language are encoded in the JIT compiler code, meaning likely that
+major changes would correspond to major surgery needed on it.
+
+PyPy more indirect approach, together hopefully with our future work
+on generating a JIT compiler, tries to overcome these limitations, at
+the price of some more effort required to achieve very good
+performance. It is too soon to compare completely the complexity (and
+performance) trade-offs of these approaches.
+
+
+XXX Jython, IronPython, UVM.
+
+\section{Conclusion}
+\label{conclusion}
+
+
+XXX
+
+nice interpreter not polluted by implementation details
+here Python, but any interpreter works
+
+architecture allows implementing features at the right level
+
+dynamic language enables defining our own various type systems
+[ref pluggable type systems]
+
+practical VMs will result with a bit more efforts
+
+\end{document}