Tue Jun 14 10:35:38 CEST 2011

Author: Carl Friedrich Bolz <cfbolz at gmx.de>
Changeset: r3673:aa70af0e63da
Date: 2011-06-14 10:38 +0200

Log:	merge

diff --git a/talk/iwtc11/benchmarks/numpy/array.c b/talk/iwtc11/benchmarks/numpy/array.c
new file mode 100644
--- /dev/null
+++ b/talk/iwtc11/benchmarks/numpy/array.c
@@ -0,0 +1,38 @@
+
+// an equivalent using targetmicronumpy is aa+a+a+a+ with the same size
+
+#include <stdlib.h>
+#include <stdio.h>
+
+double *create_array(int size)
+{
+  int i;
+  double *a = (double*)malloc(size * sizeof(double));
+  for (i = 0; i < size; ++i) {
+    a[i] = (double)(i % 10);
+  }
+  return a;
+}
+
+#define MAX 5
+#define SIZE 10000000
+#define ITERATIONS 10
+
+int main()
+{
+  double *a[MAX];
+  double *res;
+  int i, k;
+
+  for (i = 0; i < MAX; ++i) {
+    a[i] = create_array(SIZE);
+  }
+  res = create_array(SIZE);
+  // actual loop
+  for (k = 0; k < ITERATIONS; ++k) {
+    for (i = 0; i < SIZE; ++i) {
+      res[i] = a[0][i] + a[1][i] + a[2][i] + a[3][i] + a[4][i];
+    }
+    printf("%f\n", res[125]); // to kill the optimizer
+  }
+}
diff --git a/talk/iwtc11/benchmarks/runall.sh b/talk/iwtc11/benchmarks/runall.sh
--- a/talk/iwtc11/benchmarks/runall.sh
+++ b/talk/iwtc11/benchmarks/runall.sh
@@ -6,5 +6,6 @@
./benchmark.sh gcc
./benchmark.sh gcc -O2
./benchmark.sh gcc -O3 -march=native
+./benchmark.sh gcc -O3 -march=native -fno-tree-vectorize
./benchmark.sh python2.7

diff --git a/talk/iwtc11/paper.tex b/talk/iwtc11/paper.tex
--- a/talk/iwtc11/paper.tex
+++ b/talk/iwtc11/paper.tex
@@ -109,11 +109,14 @@
%\subtitle{Subtitle Text, if any}

\authorinfo{Hakan Ardo XXX}
-           {Affiliation1}
+           {Centre for Mathematical Sciences, Lund University}
{hakan at debian.org}
\authorinfo{Carl Friedrich Bolz}
{Heinrich-Heine-Universit&#228;t D&#252;sseldorf}
{cfbolz at gmx.de}
+\authorinfo{Maciej Fija&#322;kowski}
+           {Affiliation2}
+           {fijall at gmail.com}

\maketitle

@@ -208,11 +211,10 @@

Let us now consider a simple interpreter'' function \lstinline{f} that uses the
object model (see the bottom of Figure~\ref{fig:objmodel}).
-The loop in \lstinline{f} iterates \lstinline{y} times, and computes something in the process.
Simply running this function is slow, because there are lots of virtual method
calls inside the loop, one for each \lstinline{is_positive} and even two for each
call to \lstinline{add}. These method calls need to check the type of the involved
-objects repeatedly and redundantly. In addition, a lot of objects are created
+objects every iteration. In addition, a lot of objects are created
when executing that loop, many of these objects are short-lived.
The actual computation that is performed by \lstinline{f} is simply a sequence of
@@ -229,7 +231,7 @@
guard_class($p_{0}$, BoxedInteger)
$i_{3}$ = get($p_{0}$, intval)
-        $i_{4}$ = int_add($i_{2}$, $i_{3}$)
+        $i_{4}$ = $i_{2} + i_{3}$
$p_{5}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{5}$, intval, $i_{4}$)
@@ -263,8 +265,6 @@
\item \lstinline{set} writes to an attribute of an object.
\item \lstinline{guard_class} is a precise type check and precedes an
(inlined) method call and is followed by the trace of the called method.
-    comparison (greater than''), respectively.
\item \lstinline{guard_true} checks that a boolean is true.
\end{itemize}

@@ -279,23 +279,12 @@
first \lstinline{guard_class} instruction will fail and execution will continue
using the interpreter.

-The trace shows the inefficiencies of \lstinline{f} clearly, if one looks at
-the number of \lstinline{new}, \lstinline{set/get} and \lstinline{guard_class}
-operations. The number of \lstinline{guard_class} operation is particularly
-problematic, not only because of the time it takes to run them. All guards also
-interpreter, should the guard fail. This means that too many guard operations also
-consume a lot of memory.
-
-In the rest of the paper we will see how this trace can be optimized using
-partial evaluation.
-
\section{Optimizations}
Before the trace is passed to a backend compiling it into machine code
it needs to be optimized to achieve better performance.
The focus of this paper
is loop invariant code motion. The goal of that is to move as many
-operations as possible out of the loop making them executed only once
+operations as possible out of the loop making them executed at most once
and not every iteration. This we propose to achieve by loop peeling. It
leaves the loop body intact, but prefixes it with one iteration of the
loop. This operation by itself will not achieve anything. But if it is
@@ -310,12 +299,16 @@

XXX find reference

-Loop peeling is achieved prefixing the loop with one iteration of itself. The
-peeled of iteration of the loop will end with a jump to the full loop, which
-ends with a jump to itself. This way the peeled of iteration will only be
-executed once while the second copy will be used for every further iteration.
+Loop peeling is achieved by appending a copy of the traced iteration at
+the end of the loop. The copy  is inlined to make the two parts form a
+consitant two iteration trace.
+The first part (called preamble) finishes with the jump the the second part
+(called peeled loop). The second part ends up with the jump to itself. This way
+the preamble will be executed only once while the peeled loop will
+be used for every other iteration.
The trace from Figure~\ref{fig:unopt-trace} would after this operation become
-the trace in Figure~\ref{fig:peeled-trace}.
+the trace in Figure~\ref{fig:peeled-trace}. Line 1-13 shows the
+preamble while line 15-27 shows the peeled loop.

\begin{figure}
\begin{lstlisting}[mathescape,numbers = right,basicstyle=\setstretch{1.05}\ttfamily\scriptsize]
@@ -327,7 +320,7 @@
guard_class($p_{0}$, BoxedInteger)
$i_{3}$ = get($p_{0}$, intval)
-        $i_{4}$ = int_add($i_{2}$, $i_{3}$)
+        $i_{4}$ = $i_{2}+i_{3}$
$p_{5}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{5}$, intval, $i_{4}$)
@@ -341,23 +334,23 @@
guard_class($p_{0}$, BoxedInteger)
$i_{7}$ = get($p_{0}$, intval)
-        $i_{8}$ = int_add($i_{6}$, $i_{7}$)
+        $i_{8}$ = $i_{6}+i_{7}$
$p_{9}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{9}$, intval, $i_{8}$)
jump($l_1$, $p_{0}$, $p_{9}$)
\end{lstlisting}
-\caption{An Unoptimized Trace of the Example Interpreter}
+\caption{A peeled trace of the Example Interpreter}
\label{fig:peeled-trace}
\end{figure}

When applying the following optimizations to this two-iteration trace
-some care has to taken as to how the jump arguments of both
-iterations and the input arguments of the second iteration are
-treated. It has to be ensured that the second iteration stays a proper
-trace in the sense that the operations within it only operations on
-variables that are either among the input arguments of the second iterations
-or are produced within the second iterations. To ensure this we need
+some care has to taken as to how the arguments of the two
+\lstinline{jump} operations and the input arguments of the peeled loop are
+treated. It has to be ensured that the peeled loop stays a proper
+trace in the sense that the operations within it only operates on
+variables that are either among its input arguments
+or produced within the peeled loop. To ensure this we need
to introduce a bit of formalism.

The original trace (prior to peeling) consists of three parts.
@@ -367,7 +360,7 @@
jump operation. The jump operation contains a vector of jump variables,
$J=\left(J_1, J_2, \cdots, J_{|J|}\right)$, that are passed as the input variables of the target loop. After
loop peeling there will be a second copy of this trace with input
-variables equal to the jump arguments of the peeled copy, $J$, and jump
+variables equal to the jump arguments of the pereamble, $J$, and jump
arguments $K$. Looking back at our example we have
$$%\left\{ @@ -380,8 +373,8 @@ .$$
To construct the second iteration from the first we also need a
-function $m$, mapping the variables of the first iteration onto the
-variables of the second. This function is constructed during the
+function $m$, mapping the variables of the preamble onto the
+variables of the peeled loop. This function is constructed during the
inlining. It is initialized by mapping the input arguments, $I$, to
the jump arguments $J$,
$$@@ -400,11 +393,11 @@$$

Each operation in the trace is inlined in order.
-To inline an operation $v=op\left(A_1, A_2, \cdots, A_{|A|}\right)$
+To inline an operation $v=\text{op}\left(A_1, A_2, \cdots, A_{|A|}\right)$
a new variable, $\hat v$ is introduced. The inlined operation will
-produce $\hat v$ from the input arguments
+produce $\hat v$ using
$$- \hat v = op\left(m\left(A_1\right), m\left(A_2\right), + \hat v = \text{op}\left(m\left(A_1\right), m\left(A_2\right), \cdots, m\left(A_{|A|}\right)\right) .$$
Before the
@@ -426,12 +419,15 @@

\subsection{Redundant Guard Removal}

+XXX should we have a mention where in the previous papers those optimizations
+
No special concerns needs to be taken when implementing redundant
guard removal together with loop peeling. The guards from
-the first iteration might make the guards of the second iterations
+the preamble might make the guards of the peeled loop
redundant and thus removed. Therefore the net effect of combining redundant
guard removal with loop peeling is that loop-invariant guards are moved out of the
-loop. The second iteration of the example reduces to
+loop. The peeled loop of the example reduces to

\begin{lstlisting}[mathescape,numbers = right,basicstyle=\setstretch{1.05}\ttfamily\scriptsize]
$l_1$($p_{0}$, $p_{5}$):
@@ -440,7 +436,7 @@
$i_{6}$ = get($p_{5}$, intval)
$i_{7}$ = get($p_{0}$, intval)
-        $i_{8}$ = int_add($i_{6}$, $i_{7}$)
+        $i_{8}$ = $i_{6}+i_{7}$
$p_{9}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{9}$, intval, $i_{8}$)
@@ -453,13 +449,18 @@
guard on line 6.

\subsection{Heap Caching}
+
+XXX gcc calles this store-sinking and I'm sure there are some
+references in the literature (none at hand though). This is a typical''
+compiler optimization.
+
The objective of heap caching is to remove \lstinline{get} and
\lstinline{set} operations whose results can be deduced from previous
\lstinline{get} and \lstinline{set} operations. Exact details of the
process are outside the scope of this paper. We only consider the interaction
with loop peeling.

-The issue at hand is to keep the second iteration a proper
+The issue at hand is to keep the peeled loop a proper
trace. Consider the \lstinline{get} operation on line 19 of
Figure~\ref{fig:unopt-trace}. The result of this operation can be
deduced to be $i_4$ from the \lstinline{set} operation on line
@@ -468,12 +469,12 @@
8. The optimization will thus remove line 19 and 22 from the trace and
replace $i_6$ with $i_4$ and $i_7$ with $i_3$.

-After that, the second
-iteration will no longer be in SSA form as it operates on $i_3$ and $i_4$
+After that, the peeled loop
+will no longer be in SSA form as it operates on $i_3$ and $i_4$
which are not part of it. The solution is to extend the input
arguments, $J$, with those two variables. This will also extend the
-jump arguments of the first iteration, which is also $J$.
-Implicitly that also extends the jump arguments of the second iteration, $K$,
+jump arguments of the preamble, which is also $J$.
+Implicitly that also extends the jump arguments of the peeled loop, $K$,
since they are the inlined versions of $J$. For the example $I$ has to
be replaced by $\hat I$ which is formed as a concatenation of $I$ and
$\left(i_3, i_4\right)$. At the same time $K$ has to be replaced by
@@ -484,15 +485,18 @@
replace $i_7=$get(...) with $i_7=i_3$ instead of removing it?

In general what is needed is for the heap optimizer is to keep track of
-which variables from the first iterations it reuses in the second
-iteration. It has to construct a vector of such variables $H$ which
-can be used to update the input and jump arguments,
+which variables from the preamble it reuses in the peeled loop.
+It has to construct a vector of such variables $H$ which
+can be used to update the input and jump arguments using
$$\hat J = \left(J_1, J_2, \cdots, J_{|J|}, H_1, H_2, \cdots, H_{|H}\right) + \label{eq:heap-inputargs}$$
+and
$$\hat K = \left(K_1, K_2, \cdots, K_{|J|}, m(H_1), m(H_2), \cdots, m(H_{|H})\right) . + \label{eq:heap-jumpargs}$$
In the optimized trace $I$ is replaced by $\hat I$ and $K$ by $\hat K$. The trace from Figure~\ref{fig:unopt-trace} will be optimized to:
@@ -506,7 +510,7 @@
guard_class($p_{0}$, BoxedInteger)
$i_{3}$ = get($p_{0}$, intval)
-        $i_{4}$ = int_add($i_{2}$, $i_{3}$)
+        $i_{4}$ = $i_{2}+i_{3}$
$p_{5}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{5}$, intval, $i_{4}$)
@@ -516,42 +520,54 @@
# inside f: y = y.add(step)
-        $i_{8}$ = int_add($i_{4}$, $i_{3}$)
+        $i_{8}$ = $i_{4}+i_{3}$
$p_{9}$ = new(BoxedInteger)
# inside BoxedInteger.__init__
set($p_{9}$, intval, $i_{8}$)
jump($l_1$, $p_{0}$, $p_{9}$, $i_3$, $i_8$)
\end{lstlisting}

+\subsection{Pure operation reusage}
+If a pure operation appears more than once in the trace with same input
+arguments, it only needs be executed the first time and then the result
+can be reused for all other appearances. When that is combined with loop
+peeling, the single execution of the operation is placed in the
+preamble. That is, loop invariant pure operations are moved out of the
+loop. The interactions here are the same as in the previous
+section. That is, a vector, $H$, of variables produced in the preamble
+and used in the peeled loop needs to be constructed. Then the jump and
+input arguments are updated according to
+Equation~\ref{eq:heap-inputargs} and Equation~\ref{eq:heap-jumpargs}.
+
\subsection{Allocation Removals}
By using escape analysis it is possible to identify objects that are
-allocated within the loop but never escape it. That is the object are
-short lived and no references to them exists outside the loop. This
-is performed by processing the operation from top to bottom and
+allocated within the loop but never escape it. That is
+short lived objects with no references outside the loop. This
+is performed by processing the operation in order and
optimistically removing every \lstinline{new} operation. Later on if
it is discovered that a reference to the object escapes the loop, the
\lstinline{new} operation is inserted at this point. All operations
(\lstinline{get} and \lstinline{set}) on the removed objects are also
removed and the optimizer needs to keep track of the value of all
-attributes of the object.
+used attributes of the object.

Consider again the original unoptimized trace of
-Figure~\label{fig:peeled-trace}. Line 10 contains the first
+Figure~\ref{fig:peeled-trace}. Line 10 contains the first
allocation. It is removed and $p_5$ is marked as virtual. This means
-that it refers to an virtual object that was not yet
+that it refers to an virtual object that was not yet been
(and might never be) allocated. Line 12 sets the \lstinline{intval}
attribute of $p_5$. This operation is also removed and the optimizer
registers that the attribute \lstinline{intval} of $p_5$ is $i_4$.

When the optimizer reaches line 13 it needs to construct the
-arguments for the \lstinline{jump} operation, which contains the virtual
+arguments of the \lstinline{jump} operation, which contains the virtual
reference $p_5$. This can be achieved by exploding $p_5$ into it's
attributes. In this case there is only one attribute and it's value is
$i_4$, which means the $p_5$ is replaced with $i_4$ in the jump
arguments.

In the general case, each virtual in the jump arguments is exploded into a
-vector of variables containing the values of all it's attributes. If some
+vector of variables containing the values of all registered attributes. If some
of the attributes are themselves virtuals they are recursively exploded
to make the vector contain only non-virtual variables. Some care has
to be taken to always place the attributes in the same order when
@@ -580,8 +596,8 @@
\right)
.

-and the arguments of the \lstinline{jump} operation of the second
-operation, $K$, are replaced by inlining $\hat J$,
+and the arguments of the \lstinline{jump} operation of the peeled loop,
+$K$, constructed by inlining $\hat J$,

\hat K = \left(m\left(\hat J_1\right), m\left(\hat J_1\right),
\cdots, m\left(\hat J_{|\hat J|}\right)\right)
@@ -599,7 +615,7 @@
guard_class($p_{0}$, BoxedInteger)
$i_{3}$ = get($p_{0}$, intval)
-        $i_{4}$ = int_add($i_{2}$, $i_{3}$)
+        $i_{4}$ = $i_{2}+i_{3}$
# inside BoxedInteger.__init__
jump($l_1$, $p_{0}$, $i_{4}$)

@@ -609,26 +625,42 @@
guard_class($p_{0}$, BoxedInteger)
$i_{7}$ = get($p_{0}$, intval)
-        $i_{8}$ = int_add($i_{4}$, $i_{7}$)
+        $i_{8}$ = $i_{4}+i_{7}$
# inside BoxedInteger.__init__
jump($l_1$, $p_{0}$, $i_8$)
\end{lstlisting}

Note that virtuals are only exploded into their attributes when
-constructing the arguments of the jump of the first iteration. This
+constructing the arguments of the jump of the preamble. This
explosion can't be repeated when constructing the arguments of the
-jump of the second iteration as it has to mach the first. This means
+jump of the peeled loop as it has to mach the first. This means
the objects that was passed as pointers (non virtuals) from the first
-iteration to the second also has to be passed as pointers from the
-second iteration to the third. If one of these objects are virtual
-at the end of the second iteration they need to be allocated right
+iteration to the second (from preamble to peeled loop) also has to be
+passed as pointers from the second iteration to the third (from peeled
+loop to peeled loop). If one of these objects are virtual
+at the end of the peeled loop they need to be allocated right
before the jump. With the simple objects considered in this paper,
that is not a problem. However in more complicated interpreters such
an allocation might, in combination with other optimizations, lead
to additional variables from the first iteration being imported into
the second. This extends both $\hat J$ and $\hat K$, which means that
some care has to be taken, when implementing this, to allow $\hat J$ to
-grow while inlining it into $\hat K$.
+grow while inlining it into $\hat K$. XXX: Maybe we can skip this?
+
+\section{Limitations}
+
+XXX as of now?
+
+Loop invariant code motion as described has certain amount of limitations
+that prevent it from speeding up larger loops. Those limitations are a target
+of future work and might be lifted. Most important ones:
+
+\begin{itemize}
+\item Bridges are not well supported - if the flow is more complex than a single
+      loop, the bridge might need to jump to the beginning of the preamble,
+      making the optimization ineffective
+\item XXX write about flushing caches at calls?
+\end{itemize}

\section{Benchmarks}

@@ -658,7 +690,7 @@
fixpoint arithmetic with 16 bits precision. In Python there is only
a single implementation of the benchmark that gets specialized
depending on the class of it's input argument, $y$, while in C,
-  there is three different implementations.
+  there are three different implementations.
\item {\bf conv3}: one-dimensional convolution with a kernel of fixed
size $3$.
\item {\bf conv5}: one-dimensional convolution with a kernel of fixed
@@ -677,9 +709,9 @@
on top of a custom image class that is specially designed for the
problem. It ensures that there will be no failing guards, and makes
a lot of the two dimension index calculations loop invariant. The
-  intention there is twofold. It shows that the performance-impact of
+  intention here is twofold. It shows that the performance-impact of
having wrapper classes giving objects some application-specific
-  properties is negligible. This is due to the inlining performed
+  properties can be negligible. This is due to the inlining performed
during the tracing and the allocation removal of the index objects
introduced. It also shows that it is possible to do some low-level
hand optimizations of the Python code and hide those optimization
@@ -689,7 +721,23 @@
XXX we need Psyco numbers

\subsection{Numpy}
-XXX: Fijal?
+
+As a part of the PyPy project, we implemented small numerical kernel for
+performing matrix operations. The exact extend of this kernel is besides
+the scope of this paper, however the basic idea is to unroll a series of
+array operations into a loop compiled into assembler. LICM is a very good
+optimization for those kind of operations. The example benchmark performs
+addition of five arrays, compiling it in a way that's equivalent to C's:
+
+\begin{figure}
+\begin{lstlisting}[mathescape,basicstyle=\setstretch{1.05}\ttfamily\scriptsize]
+for (int i = 0; i < SIZE; i++) {
+   res[i] = a[i] + b[i] + c[i] + d[i] + e[i];
+}
+\end{lstlisting}
+\end{figure}
+
+Where $res$, $a$, $b$, $c$, $d$ and $e$ are $double$ arrays.

\subsection{Prolog}
XXX: Carl?