[pypy-commit] extradoc extradoc: merge

cfbolz noreply at buildbot.pypy.org
Thu Aug 16 11:43:32 CEST 2012


Author: Carl Friedrich Bolz <cfbolz at gmx.de>
Branch: extradoc
Changeset: r4604:7680cda8c312
Date: 2012-08-16 11:42 +0200
http://bitbucket.org/pypy/extradoc/changeset/7680cda8c312/

Log:	merge

diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index d18f16a934b2496b5090209281663580f227b6e6..53e9a461f7d0e384c8c7fba88a6002c1337aaeb1
GIT binary patch

[cut]

diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -920,8 +920,9 @@
 we see improvements in several cases. The ideal loop for this optimization
 is short and contains numerical calculations with no failing guards and no
 external calls. Larger loops involving many operations on complex objects
-typically benefit less from it. Loop peeling never makes runtime performance worse, in
-the worst case the peeled loop is exactly the same as the preamble. Therefore we
+typically benefit less from it. Loop peeling never makes the generated code worse, in
+the worst case the peeled loop is exactly the same as the preamble. 
+Therefore we
 chose to present benchmarks of small numeric kernels where loop peeling can show
 its use.
 
@@ -972,7 +973,7 @@
 \subsection{Python}
 The Python interpreter of the RPython framework is a complete Python
 version 2.7 compatible interpreter. A set of numerical
-calculations were implemented in both Python and in C and their
+calculations were implemented in both Python, C and Lua and their
 runtimes are compared in Figure~\ref{fig:benchmarks}.\footnote{
     The benchmarks and the scripts to run them can be found in the repository for this paper:
     \texttt{https://bitbucket.org/pypy/extradoc/src/ tip/talk/dls2012/benchmarks}
@@ -980,30 +981,30 @@
 
 The benchmarks are
 \begin{itemize}
-\item {\bf sqrt}: approximates the square root of $y$. The approximation is 
+\item {\bf sqrt}$\left(T\right)$: approximates the square root of $y$. The approximation is 
 initiated to $x_0=y/2$ and the benchmark consists of a single loop updating this
 approximation using $x_i = \left( x_{i-1} + y/x_{i-1} \right) / 2$ for $1\leq i < 10^8$. 
 Only the latest calculated value $x_i$ is kept alive as a local variable within the loop.
 There are three different versions of this benchmark where $x_i$
-  is represented with different type of objects: int's, float's and
+  is represented with different type of objects, $T$,: int's, float's and
   Fix16's. The latter, Fix16, is a custom class that implements
   fixpoint arithmetic with 16 bits precision. In Python there is only
   a single implementation of the benchmark that gets specialized
   depending on the class of it's input argument, $y$, while in C,
   there are three different implementations.
-\item {\bf conv3}: one-dimensional convolution with fixed kernel-size $3$. A single loop
+\item {\bf conv3}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $3$. A single loop
 is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_n\right)$ from a vector
 ${\bf a} = \left(a_1, \cdots, a_n\right)$ and a kernel ${\bf k} = \left(k_1, k_2, k_3\right)$ using 
 $b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n$. Both the output vector, $\bf b$, 
 and the input vectors, $\bf a$ and $\bf k$, are allocated prior to running the benchmark. It is executed 
 with $n=10^5$ and $n=10^6$.
-\item {\bf conv5}: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with 
+\item {\bf conv5}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with 
 ${\bf k} = \left(k_1, k_2, k_3, k_4, k_5\right)$. The enumeration of the elements in $\bf k$ is still 
 hardcoded into the implementation making the benchmark consist of a single loop too.
-\item {\bf conv3x3}: two-dimensional convolution with kernel of fixed
+\item {\bf conv3x3}$\left(n,m\right)$: two-dimensional convolution with kernel of fixed
   size $3 \times 3$ using a custom class to represent two-dimensional
   arrays. It is implemented as two nested loops that iterates over the elements of the 
-$n\times n$ output matrix ${\bf B} = \left(b_{i,j}\right)$ and calculates each element from the input matrix
+$m\times n$ output matrix ${\bf B} = \left(b_{i,j}\right)$ and calculates each element from the input matrix
 ${\bf A} = \left(a_{i,j}\right)$ and a kernel ${\bf K} = \left(k_{i,j}\right)$ using $b_{i,j} = $
 \begin{equation}
   \label{eq:convsum}
@@ -1013,14 +1014,15 @@
     k_{1,3} a_{i+1,j-1} &+& k_{1,2} a_{i+1,j} &+& k_{1,1} a_{i+1,j+1}  \\
   \end{array}
 \end{equation}
-for $1 \leq i \leq n$ and $1 \leq j \leq n$.
-The memory for storing the matrices are again allocated outside the benchmark and $n=1000$ was used.
-\item {\bf dilate3x3}: two-dimensional dilation with kernel of fixed
+for $1 \leq i \leq m$ and $1 \leq j \leq n$.
+The memory for storing the matrices are again allocated outside the benchmark and $(n,m)=(1000,1000)$ 
+as well as $(n,m)=(1000000,3)$ was used.
+\item {\bf dilate3x3}$\left(n\right)$: two-dimensional dilation with kernel of fixed
   size $3 \times 3$. This is similar to convolution but instead of
   summing over the terms in Equation~\ref{eq:convsum}, the maximum over those terms is taken. That places a
   external call to a max function within the loop that prevents some
   of the optimizations.
-\item {\bf sobel}: a low-level video processing algorithm used to
+\item {\bf sobel}$\left(n\right)$: a low-level video processing algorithm used to
   locate edges in an image. It calculates the gradient magnitude
   using sobel derivatives. A Sobel x-derivative, $D_x$, of a $n \times n$ image, ${I}$, is formed
 by convolving ${I}$ with
@@ -1044,11 +1046,31 @@
 on top of a custom two-dimensional array class.
 It is
 a straightforward implementation providing 2 dimensional
-indexing with out of bounds checks. For the C implementations it is
+indexing with out of bounds checks and 
+data stored in row-major order.
+For the C implementations it is
 implemented as a C++ class. The other benchmarks are implemented in
 plain C. All the benchmarks except sqrt operate on C double-precision floating
 point numbers, both in the Python and the C code.
 
+In addition we also ported the 
+SciMark\footnote{\texttt{http://math.nist.gov/scimark2/}} benchmakts to python, and compared 
+their runtimes with the already existing Lua and C implementations. 
+This port was performed after the release of the pypy used to run the benchmarks which means that 
+these benchmarks have not influenced the pypy implementation.
+SciMark consists of 
+
+\begin{itemize}
+\item {\bf SOR}$\left(n, c\right)$: Jacobi successive over-relaxation on a $n\times n$ grid repreated $c$ times.
+The same custom two-dimensional array class as described above is used to represent
+the gird.
+\item {\bf SparseMatMult}$\left(n, z, c\right)$: Matrix multiplication between a $n\times n$ sparse matrix,
+stored in compressed-row format, and a full storage vector, stored in a normal array. The matrix has $z$ non-zero elements and the calculation is repeated $c$ times.
+\item {\bf MonteCarlo}$\left(n\right)$: Monte Carlo integration by generating $n$ points uniformly distributed over the unit square and computing the ratio of those within the unit circle.
+\item {\bf LU}$\left(n, c\right)$: LU factorization of an $n \times n$ matrix. The rows of the matrix is shuffled which makes the previously used two-dimensional array class unsuitable. Instead a list of arrays is used to represent the matrix. The calculation is repeated $c$ times.
+\item {\bf FFT}$\left(n, c\right)$: Fast Fourier Transform of a vector with $n$ elements, represented as an array, repeated $c$ times.
+\end{itemize}
+
 Benchmarks were run on Intel i7 M620 @2.67GHz with 4M cache and 8G of RAM
 using Ubuntu Linux 11.4 in 32bit mode.
 The machine was otherwise unoccupied. We use the following software
@@ -1064,6 +1086,10 @@
 We run GCC with -O3 -march=native, disabling the
 automatic loop vectorization. In all cases, SSE2 instructions were used for
 floating point operations, except Psyco which uses x87 FPU instructions.
+% Psyco does not use the x87 FPU: all floating-point arithmetic is done with
+% residual calls to C helpers.  These can probably be compiled with SSE2.
+% But compiling CPython (and maybe Psyco) for x87 or SSE2 has probably
+% no measurable effect.
 We also run PyPy with loop peeling optimization and without (but otherwise
 identical).
 
diff --git a/talk/iwtc11/benchmarks/benchmark.sh b/talk/iwtc11/benchmarks/benchmark.sh
--- a/talk/iwtc11/benchmarks/benchmark.sh
+++ b/talk/iwtc11/benchmarks/benchmark.sh
@@ -23,10 +23,30 @@
     ./runner.py -n 5 -c "$*" scimark/run_MonteCarlo.c 268435456
     ./runner.py -n 5 -c "$*" scimark/run_LU.c 100 4096
     ./runner.py -n 5 -c "$*" scimark/run_LU.c 1000 2
+    ./runner.py -n 5 -c "$* -lm" scimark/run_FFT.c 1024 32768
+    ./runner.py -n 5 -c "$* -lm" scimark/run_FFT.c 1048576 2
     rm a.out
 elif [[ "$1" == luajit* ]]; then
+    $* runner.lua sqrt int
+    $* runner.lua sqrt float
+    $* runner.lua sqrt Fix16
+    $* runner.lua convolution conv3 100
+    $* runner.lua convolution conv5 100
+    $* runner.lua convolution conv3 1000
+    $* runner.lua convolution conv5 1000
+    $* runner.lua convolution conv3x3 1000000 3
+    $* runner.lua convolution conv3x3 1000 1000
+    $* runner.lua convolution dilate3x3 1000 1000
+    $* runner.lua convolution sobel_magnitude 1000 1000
     $* runner.lua SOR 100 32768
     $* runner.lua SOR 1000 256
+    $* runner.lua SparseMatMult 1000 5000 262144
+    $* runner.lua SparseMatMult 100000 1000000 1024
+    $* runner.lua MonteCarlo 268435456
+    $* runner.lua LU 100 4096
+    $* runner.lua LU 1000 2
+    $* runner.lua FFT 1024 32768
+    $* runner.lua FFT 1048576 2
 else
     if [ "$1" == "python2.7" ]; then
         EXTRA_OPTS='-w 0 -n 1'
@@ -57,11 +77,13 @@
     #$* ./runner.py $EXTRA_OPTS image/sobel.py main NoBorderImagePadded uint8
     $* ./runner.py $EXTRA_OPTS scimark.py SOR 100 32768 Array2D
     $* ./runner.py $EXTRA_OPTS scimark.py SOR 1000 256 Array2D
-    $* ./runner.py $EXTRA_OPTS scimark.py SOR 100 32768 ArrayList
-    $* ./runner.py $EXTRA_OPTS scimark.py SOR 1000 256 ArrayList
+    #$* ./runner.py $EXTRA_OPTS scimark.py SOR 100 32768 ArrayList
+    #$* ./runner.py $EXTRA_OPTS scimark.py SOR 1000 256 ArrayList
     $* ./runner.py $EXTRA_OPTS scimark.py SparseMatMult 1000 5000 262144
     $* ./runner.py $EXTRA_OPTS scimark.py SparseMatMult 100000 1000000 1024
     $* ./runner.py $EXTRA_OPTS scimark.py MonteCarlo 268435456
     $* ./runner.py $EXTRA_OPTS scimark.py LU 100 4096
     $* ./runner.py $EXTRA_OPTS scimark.py LU 1000 2
+    $* ./runner.py $EXTRA_OPTS scimark.py FFT 1024 32768
+    $* ./runner.py $EXTRA_OPTS scimark.py FFT 1048576 2
 fi
diff --git a/talk/iwtc11/benchmarks/convolution/convolution.lua b/talk/iwtc11/benchmarks/convolution/convolution.lua
--- a/talk/iwtc11/benchmarks/convolution/convolution.lua
+++ b/talk/iwtc11/benchmarks/convolution/convolution.lua
@@ -1,3 +1,4 @@
+module(..., package.seeall);
 local ffi = require("ffi")
 
 function array(length, initializer)
@@ -174,5 +175,5 @@
     return string.format("%s", arg)
 end
 
-main(arg)
+--main(arg)
 
diff --git a/talk/iwtc11/benchmarks/result.txt b/talk/iwtc11/benchmarks/result.txt
--- a/talk/iwtc11/benchmarks/result.txt
+++ b/talk/iwtc11/benchmarks/result.txt
@@ -1,129 +1,189 @@
 
 pypy
-sqrt(float):   1.20290899277
-  sqrt(int):   2.41840982437
-sqrt(Fix16):   6.10620713234
-conv3(1e8):    2.5192759037
-conv5(1e8):    2.89429306984
-conv3(1e6):    0.828789949417
-conv5(1e6):    1.01669406891
-conv3(1e5):    0.777491092682
-conv5(1e5):    0.971807956696
-conv3x3(3):    0.653658866882
-conv3x3(1000): 0.748742103577
-dilate3x3(1000): 4.8826611042
-NoBorderImagePadded: 2.31043601036
-NoBorderImagePadded(iter): 0.572638988495
-NoBorderImagePadded(range): 0.494098186493
-NoBorderImage: 2.90333104134
-NoBorderImage(iter): 2.06943392754
-NoBorderImage(range): 1.99161696434
-sobel(NoBorderImagePadded): 0.668392896652
+sqrt(int): 3.9497149229 +- 0.00120169176702
+sqrt(float): 1.18568074703 +- 0.000155574177096
+sqrt(Fix16): 4.33989310265 +- 0.00141233338935
+conv3(array(1e6)): 0.509183955193 +- 0.0118453357313
+conv5(array(1e6)): 0.69121158123 +- 0.00750138546764
+conv3(array(1e5)): 0.4399548769 +- 0.00179808936191
+conv5(array(1e5)): 0.641533112526 +- 0.00283121562299
+conv3x3(Array2D(1000000x3)): 0.32311899662 +- 0.00297940582696
+conv3x3(Array2D(1000x1000)): 0.294556212425 +- 0.00394363604342
+dilate3x3(Array2D(1000x1000)): 5.62028222084 +- 0.0100742850395
+sobel(Array2D(1000x1000)): 0.353349781036 +- 0.000422230713013
+SOR(100, 32768): 3.6967458725 +- 0.00479411350316
+SOR(1000, 256): 2.92602846622 +- 0.00460152567878
+SOR(100, 32768): 5.91232867241 +- 0.0575417343725
+SOR(1000, 256): 4.48931508064 +- 0.0545822457385
+SparseMatMult(1000, 5000, 262144): 45.573383832 +- 0.628020354674
+SparseMatMult(100000, 1000000, 1024): 31.8840100527 +- 0.0835424264131
+MonteCarlo(268435456): 18.0108832598 +- 0.0590538416431
+LU(100, 4096): 17.11741395 +- 0.146651016873
+LU(1000, 2): 8.36587500572 +- 0.0643368943091
 
-pypy --jit enable_opts=intbounds:rewrite:virtualize:heap:unroll
-sqrt(float):   1.19338798523
-  sqrt(int):   2.42711806297
-sqrt(Fix16):   6.12403416634
-conv3(1e8):    2.06937193871
-conv5(1e8):    2.26879811287
-conv3(1e6):    0.837247848511
-conv5(1e6):    1.02573990822
-conv3(1e5):    0.779927015305
-conv5(1e5):    0.975258827209
-conv3x3(3):    0.663229942322
-conv3x3(1000): 0.763913154602
-dilate3x3(1000): 4.80735611916
-NoBorderImagePadded: 2.33380198479
-NoBorderImagePadded(iter): 0.504709005356
-NoBorderImagePadded(range): 0.503198862076
-NoBorderImage: 2.93766593933
-NoBorderImage(iter): 2.04195189476
-NoBorderImage(range): 2.02779984474
-sobel(NoBorderImagePadded): 0.670017004013
+pypy --jit enable_opts=intbounds:rewrite:virtualize:string:earlyforce:pure:heap:ffi
+sqrt(int): 5.38412702084 +- 0.0100677718267
+sqrt(float): 2.49882881641 +- 0.000611829128708
+sqrt(Fix16): 9.08926799297 +- 0.00638996685205
+conv3(array(1e6)): 2.07706921101 +- 0.0578137268002
+conv5(array(1e6)): 2.29385373592 +- 0.239051363255
+conv3(array(1e5)): 1.9695744276 +- 0.00699373341986
+conv5(array(1e5)): 2.06334021091 +- 0.00461312422073
+conv3x3(Array2D(1000000x3)): 0.913360571861 +- 0.00406856919645
+conv3x3(Array2D(1000x1000)): 0.906745815277 +- 0.011800811341
+dilate3x3(Array2D(1000x1000)): 5.94119987488 +- 0.0177689080267
+sobel(Array2D(1000x1000)): 0.879287624359 +- 0.00351199656947
+SOR(100, 32768): 13.3457442522 +- 0.15597493782
+SOR(1000, 256): 10.6485268593 +- 0.0335292228831
+SOR(100, 32768): 15.2722632885 +- 0.149270948773
+SOR(1000, 256): 12.2542063951 +- 0.0467913588079
+SparseMatMult(1000, 5000, 262144): 51.7010503292 +- 0.0900830635215
+SparseMatMult(100000, 1000000, 1024): 34.0754101276 +- 0.0854521241748
+MonteCarlo(268435456): 27.4164168119 +- 0.00974970184296
+LU(100, 4096): 48.2948143244 +- 0.509639206256
+LU(1000, 2): 24.4584824085 +- 0.0807806236077
 
-pypy --jit enable_opts=intbounds:rewrite:virtualize:heap
-sqrt(float):   1.69957995415
-  sqrt(int):   3.13235807419
-sqrt(Fix16):   10.325592041
-conv3(1e8):    2.997631073
-conv5(1e8):    3.13820099831
-conv3(1e6):    1.7843170166
-conv5(1e6):    1.94643998146
-conv3(1e5):    1.75876712799
-conv5(1e5):    1.96709895134
-conv3x3(3):    1.09958791733
-conv3x3(1000): 1.02993702888
-dilate3x3(1000): 5.22873902321
-NoBorderImagePadded: 2.45174002647
-NoBorderImagePadded(iter): 1.60747289658
-NoBorderImagePadded(range): 1.55282211304
-NoBorderImage: 2.91020989418
-NoBorderImage(iter): 1.97922706604
-NoBorderImage(range): 2.14161992073
-sobel(NoBorderImagePadded): 1.47591900826
+pypy-1.5
+sqrt(int): 4.01375324726 +- 0.0011476694851
+sqrt(float): 1.18687217236 +- 0.000301798978394
+sqrt(Fix16): 4.86933817863 +- 0.00205854686543
+conv3(array(1e6)): 0.805051374435 +- 0.0063356172758
+conv5(array(1e6)): 1.06881151199 +- 0.166557589133
+conv3(array(1e5)): 0.767954874039 +- 0.00310620949945
+conv5(array(1e5)): 0.965079665184 +- 0.000806628058215
+conv3x3(Array2D(1000000x3)): 0.335144019127 +- 0.00049856745349
+conv3x3(Array2D(1000x1000)): 0.29465200901 +- 0.000517387744409
+dilate3x3(Array2D(1000x1000)): 4.75037336349 +- 0.0580217877578
+sobel(Array2D(1000x1000)): 0.663321614265 +- 0.122793251782
+SOR(100, 32768): 4.81084053516 +- 0.00994169505717
+SOR(1000, 256): 3.69062592983 +- 0.000879615350989
+SparseMatMult(1000, 5000, 262144): 29.4872629166 +- 0.10046773485
+SparseMatMult(100000, 1000000, 1024): 16.4197937727 +- 0.0719696247072
+MonteCarlo(268435456): 33.0701499462 +- 0.0638672466435
 
-gcc
-sqrt(float):   1.43
-sqrt(int):     1.93
-sqrt(Fix16):   2.04
-conv3(1e8):     2.03
-conv5(1e8):     2.39
-conv3(1e6):     1.66
-conv5(1e6):     2.03
-conv3(1e5):     1.60
-conv5(1e5):     2.02
-conv3x3(3):  1.81
-conv3x3(1000):  1.79
-dilate3x3(1000):  3.26
-sobel_magnitude:  1.37
+pypy-1.5 --jit enable_opts=intbounds:rewrite:virtualize:heap
+sqrt(int): 4.90680310726 +- 0.0163989281435
+sqrt(float): 1.76404910088 +- 0.019897073087
+sqrt(Fix16): 9.64484581947 +- 0.114181653484
+conv3(array(1e6)): 2.09028859138 +- 0.0553368910699
+conv5(array(1e6)): 1.98986980915 +- 0.0147589410577
+conv3(array(1e5)): 2.03130574226 +- 0.0153185288294
+conv5(array(1e5)): 1.95361895561 +- 0.00846210060946
+conv3x3(Array2D(1000000x3)): 0.771404409409 +- 0.00438046479707
+conv3x3(Array2D(1000x1000)): 0.724743962288 +- 0.00330094765836
+dilate3x3(Array2D(1000x1000)): 4.96963682175 +- 0.00698590266664
+sobel(Array2D(1000x1000)): 1.63008458614 +- 1.3629432655
+SOR(100, 32768): 13.871041584 +- 0.0322488434431
+SOR(1000, 256): 11.9500208616 +- 0.00961527429654
+SparseMatMult(1000, 5000, 262144): 37.7395636082 +- 0.108390387625
+SparseMatMult(100000, 1000000, 1024): 27.7381374121 +- 0.105548816891
+MonteCarlo(268435456): 30.6472777128 +- 0.0437974003055
 
-gcc -O2
-sqrt(float):   1.15
-sqrt(int):     1.86
-sqrt(Fix16):   1.89
-conv3(1e8):     1.22
-conv5(1e8):     1.37
-conv3(1e6):     1.00
-conv5(1e6):     1.04
-conv3(1e5):     0.81
-conv5(1e5):     0.97
-conv3x3(3):  0.25
-conv3x3(1000):  0.23
-dilate3x3(1000):  0.27
-sobel_magnitude:  0.25
-
-gcc -O3 -march=native
-sqrt(float):   1.15
-sqrt(int):     1.82
-sqrt(Fix16):   1.89
-conv3(1e8):     1.12
-conv5(1e8):     1.16
-conv3(1e6):     0.96
-conv5(1e6):     0.97
-conv3(1e5):     0.66
-conv5(1e5):     0.75
-conv3x3(3):  0.23
-conv3x3(1000):  0.21
-dilate3x3(1000):  0.26
-sobel_magnitude:  0.25
+gcc -O3 -march=native -fno-tree-vectorize
+sqrt(float): 1.14 +- 0.0
+sqrt(int): 1.85 +- 0.0
+sqrt(Fix16): 1.992 +- 0.004472135955
+conv3(1e6): 1.066 +- 0.00547722557505
+conv5(1e6): 1.104 +- 0.00547722557505
+conv3(1e5): 0.75 +- 0.0
+conv5(1e5): 1.03 +- 0.0
+conv3x3(3): 0.22 +- 3.10316769156e-17
+conv3x3(1000): 0.2 +- 0.0
+dilate3x3(1000): 0.2 +- 0.0
+SOR(100,32768): 2.506 +- 0.00547722557505
+SOR(1000,256): 2.072 +- 0.004472135955
+SparseMatMult(1000,5000,262144): 2.54 +- 0.0
+SparseMatMult(100000,1000000,1024): 2.398 +- 0.004472135955
+MonteCarlo(268435456): 2.52 +- 0.0
+LU(100,4096): 1.882 +- 0.004472135955
+LU(1000,2): 2.036 +- 0.00547722557505
 
 python2.7
-sqrt(float):   34.9008591175
-  sqrt(int):   19.6919620037
-sqrt(Fix16):   966.111785889
-conv3(1e8):    69.0758299828
-conv5(1e8):    101.503945827
-conv3(1e6):    62.212736845
-conv5(1e6):    93.5375850201
-conv3(1e5):    61.4343979359
-conv5(1e5):    93.6144771576
-conv3x3(3):    198.12590003
-conv3x3(1000): 193.030704975
-dilate3x3(1000): 192.323596954
-NoBorderImagePadded: 512.473811865
-NoBorderImagePadded(iter): 503.393321991
-NoBorderImagePadded(range): 493.907886028
-NoBorderImage: 501.37309289
-NoBorderImage(iter): 495.473101139
-NoBorderImage(range): 493.572232008
-sobel(NoBorderImagePadded): 433.678281069
+sqrt(int): 15.5302910805
+sqrt(float): 19.8081839085
+sqrt(Fix16): 690.281599045
+conv3(array(1e6)): 58.9430649281
+conv5(array(1e6)): 88.9902608395
+conv3(array(1e5)): 60.0520131588
+conv5(array(1e5)): 88.7499320507
+conv3x3(Array2D(1000000x3)): 182.564875841
+conv3x3(Array2D(1000x1000)): 179.802839994
+dilate3x3(Array2D(1000x1000)): 177.197051048
+sobel(Array2D(1000x1000)): 132.991428852
+SOR(100, 32768): 1854.50835085
+SOR(1000, 256): 1506.28460383
+SOR(100, 32768): 1279.75841594
+SOR(1000, 256): 1038.63221002
+SparseMatMult(1000, 5000, 262144): 456.105548859
+SparseMatMult(100000, 1000000, 1024): 272.003329039
+MonteCarlo(268435456): 800.114681005
+LU(100, 4096): 2704.15891314
+LU(1000, 2): 1317.06345105
+
+python2.6 psyco-wrapper.py
+
+luajit-2.0.0-beta10
+sqrt(int): 1.185000 +- 0.005270
+sqrt(float): 1.185000 +- 0.005270
+sqrt(Fix16): 106.936000 +- 0.350213
+convolution(conv3): 0.476000 +- 0.005164
+convolution(conv5): 0.478000 +- 0.012293
+convolution(conv3): 0.172000 +- 0.006325
+convolution(conv5): 0.286000 +- 0.005164
+convolution(conv3x3): 0.207000 +- 0.004830
+convolution(conv3x3): 0.167000 +- 0.006749
+convolution(dilate3x3): 0.165000 +- 0.005270
+convolution(sobel_magnitude): 0.398000 +- 0.006325
+SOR(100, 32768): 2.186000 +- 0.005164
+SOR(1000, 256): 1.797000 +- 0.006749
+SparseMatMult(1000,5000,262144): 6.642000 +- 0.049621
+SparseMatMult(100000,1000000,1024): 3.846000 +- 0.023664
+MonteCarlo(268435456): 4.082000 +- 0.004216
+LU(100, 4096): 2.371000 +- 0.019120
+LU(1000, 2): 2.141000 +- 0.037550
+FFT(1024, 32768): 3.900000 +- 0.010541
+FFT(1048576, 2): 2.815000 +- 0.142848
+
+luajit-2.0.0-beta10 -O-loop
+sqrt(int): 1.462000 +- 0.004216
+sqrt(float): 1.462000 +- 0.004216
+sqrt(Fix16): 102.775000 +- 0.332106
+convolution(conv3): 0.950000 +- 0.006667
+convolution(conv5): 1.219000 +- 0.077093
+convolution(conv3): 0.894000 +- 0.005164
+convolution(conv5): 1.150000 +- 0.004714
+convolution(conv3x3): 0.734000 +- 0.005164
+convolution(conv3x3): 0.691000 +- 0.007379
+convolution(dilate3x3): 0.710000 +- 0.012472
+convolution(sobel_magnitude): 0.833000 +- 0.009487
+SOR(100, 32768): 2.727000 +- 0.004830
+SOR(1000, 256): 2.264000 +- 0.005164
+SparseMatMult(1000,5000,262144): 13.485000 +- 0.235384
+SparseMatMult(100000,1000000,1024): 10.869000 +- 0.014491
+MonteCarlo(268435456): 5.943000 +- 0.006749
+LU(100, 4096): 11.064000 +- 0.019551
+LU(1000, 2): 5.109000 +- 0.005676
+FFT(1024, 32768): 5.999000 +- 0.007379
+FFT(1048576, 2): 2.997000 +- 0.137602
+
+luajit-master
+sqrt(int): 1.185000 +- 0.005270
+sqrt(float): 1.185000 +- 0.005270
+sqrt(Fix16): 1.739000 +- 0.003162
+convolution(conv3): 0.477000 +- 0.008233
+convolution(conv5): 0.474000 +- 0.005164
+convolution(conv3): 0.165000 +- 0.005270
+convolution(conv5): 0.286000 +- 0.005164
+convolution(conv3x3): 0.207000 +- 0.004830
+convolution(conv3x3): 0.167000 +- 0.006749
+convolution(dilate3x3): 0.163000 +- 0.006749
+convolution(sobel_magnitude): 0.403000 +- 0.009487
+SOR(100, 32768): 2.187000 +- 0.006749
+SOR(1000, 256): 1.802000 +- 0.006325
+SparseMatMult(1000,5000,262144): 6.683000 +- 0.029833
+SparseMatMult(100000,1000000,1024): 3.870000 +- 0.037712
+MonteCarlo(268435456): 4.035000 +- 0.005270
+LU(100, 4096): 2.351000 +- 0.008756
+LU(1000, 2): 2.107000 +- 0.018288
+FFT(1024, 32768): 3.926000 +- 0.010750
+FFT(1048576, 2): 2.865000 +- 0.064334
diff --git a/talk/iwtc11/benchmarks/runall.sh b/talk/iwtc11/benchmarks/runall.sh
--- a/talk/iwtc11/benchmarks/runall.sh
+++ b/talk/iwtc11/benchmarks/runall.sh
@@ -10,6 +10,8 @@
 ./benchmark.sh gcc -O3 -march=native -fno-tree-vectorize
 ./benchmark.sh python2.7
 ./benchmark.sh python2.6 psyco-wrapper.py
-./benchmark.sh luajit-2.0.0-beta10
-./benchmark.sh luajit-2.0.0-beta10 -O-loop
-./benchmakr.sh luajit
+#./benchmark.sh luajit-2.0.0-beta10
+#./benchmark.sh luajit-2.0.0-beta10 -O-loop
+./benchmark.sh luajit-master
+./benchmark.sh luajit-master -O-loop
+#./benchmark.sh luajit
diff --git a/talk/iwtc11/benchmarks/runner.lua b/talk/iwtc11/benchmarks/runner.lua
--- a/talk/iwtc11/benchmarks/runner.lua
+++ b/talk/iwtc11/benchmarks/runner.lua
@@ -6,11 +6,50 @@
 
 function benchmarks.SOR(n, cycles)
     n, cycles = tonumber(n), tonumber(cycles)
-    local mat = scimark.random_matrix(n, n)
-    scimark.sor_run(mat, n, n, cycles, 1.25)
+    scimark.benchmarks.SOR(n)(cycles)
     return string.format('SOR(%d, %d)', n, cycles)
 end
 
+function benchmarks.SparseMatMult(n, nz, cycles)
+    n, nz, cycles = tonumber(n), tonumber(nz), tonumber(cycles)
+    scimark.benchmarks.SPARSE(n, nz)(cycles)
+    return string.format('SparseMatMult(%d,%d,%d)', n, nz, cycles)
+end
+
+function benchmarks.MonteCarlo(cycles)
+    cycles = tonumber(cycles)
+    scimark.benchmarks.MC()(cycles)
+    return string.format('MonteCarlo(%d)', cycles)
+end
+
+function benchmarks.LU(n, cycles)
+    n, cycles = tonumber(n), tonumber(cycles)
+    scimark.benchmarks.LU(n)(cycles)
+    return string.format('LU(%d, %d)', n, cycles)
+end
+
+function benchmarks.FFT(n, cycles)
+    n, cycles = tonumber(n), tonumber(cycles)
+    scimark.benchmarks.FFT(n)(cycles)
+    return string.format('FFT(%d, %d)', n, cycles)
+end
+
+package.path = package.path .. ";sqrt/?.lua"
+require('sqrt')
+function benchmarks.sqrt(a)
+    return string.format('sqrt(%s)', sqrt.main({a}))
+end
+
+package.path = package.path .. ";convolution/?.lua"
+require('convolution')
+function benchmarks.convolution(a, b, c)
+    convolution.main({a, b, c})
+    return string.format('%s(%s, %s)', a, b, tostring(c))
+end
+
+
+
+
 function measure(name, ...)
     scimark.array_init()
     scimark.rand_init(101009)
diff --git a/talk/iwtc11/benchmarks/scimark.lua b/talk/iwtc11/benchmarks/scimark.lua
--- a/talk/iwtc11/benchmarks/scimark.lua
+++ b/talk/iwtc11/benchmarks/scimark.lua
@@ -37,7 +37,7 @@
 local RANDOM_SEED = 101009 -- Must be odd.
 local SIZE_SELECT = "small"
 
-local benchmarks = {
+benchmarks = {
   "FFT", "SOR", "MC", "SPARSE", "LU",
   small = {
     FFT		= { 1024 },
@@ -213,7 +213,7 @@
 -- SOR: Jacobi Successive Over-Relaxation.
 ------------------------------------------------------------------------------
 
-function sor_run(mat, m, n, cycles, omega)
+local function sor_run(mat, m, n, cycles, omega)
   local om4, om1 = omega*0.25, 1.0-omega
   m = m - 1
   n = n - 1
diff --git a/talk/iwtc11/benchmarks/scimark.py b/talk/iwtc11/benchmarks/scimark.py
--- a/talk/iwtc11/benchmarks/scimark.py
+++ b/talk/iwtc11/benchmarks/scimark.py
@@ -1,5 +1,6 @@
 from convolution.convolution import Array2D
 from array import array
+import math
 
 class Random(object):
     MDIG = 32
@@ -64,6 +65,10 @@
             a[x, y] = self.nextDouble()
         return a
 
+    def RandomVector(self, n):
+        return array('d', [self.nextDouble() for i in xrange(n)])
+    
+
 class ArrayList(Array2D):
     def __init__(self, w, h, data=None):
         self.width = w
@@ -185,3 +190,106 @@
         lu.copy_data_from(A)
         LU_factor(lu, pivot)
     return 'LU(%d, %d)' % (N, cycles)
+
+def int_log2(n):
+    k = 1
+    log = 0
+    while k < n:
+        k *= 2
+        log += 1
+    if n != 1 << log:
+        raise Exception("FFT: Data length is not a power of 2: %s" % n)
+    return log
+
+def FFT_num_flops(N):
+    return (5.0 * N - 2) * int_log2(N) + 2 * (N + 1)
+
+def FFT_transform_internal(N, data, direction):
+    n = N / 2
+    bit = 0
+    dual = 1
+    if n == 1:
+        return
+    logn = int_log2(n)
+    if N == 0:
+        return
+    FFT_bitreverse(N, data)
+
+    # apply fft recursion
+    # this loop executed int_log2(N) times
+    bit = 0
+    while bit < logn:
+        w_real = 1.0
+        w_imag = 0.0
+        theta = 2.0 * direction * math.pi / (2.0 * float(dual))
+        s = math.sin(theta)
+        t = math.sin(theta / 2.0)
+        s2 = 2.0 * t * t
+        for b in range(0, n, 2 * dual):
+            i = 2 * b
+            j = 2 * (b + dual)
+            wd_real = data[j]
+            wd_imag = data[j + 1]
+            data[j] = data[i] - wd_real
+            data[j + 1] = data[i + 1] - wd_imag
+            data[i] += wd_real
+            data[i + 1] += wd_imag
+        for a in xrange(1, dual):
+            tmp_real = w_real - s * w_imag - s2 * w_real
+            tmp_imag = w_imag + s * w_real - s2 * w_imag
+            w_real = tmp_real
+            w_imag = tmp_imag
+            for b in range(0, n, 2 * dual):
+                i = 2 * (b + a)
+                j = 2 * (b + a + dual)
+                z1_real = data[j]
+                z1_imag = data[j + 1]
+                wd_real = w_real * z1_real - w_imag * z1_imag
+                wd_imag = w_real * z1_imag + w_imag * z1_real
+                data[j] = data[i] - wd_real
+                data[j + 1] = data[i + 1] - wd_imag
+                data[i] += wd_real
+                data[i + 1] += wd_imag
+        bit += 1
+        dual *= 2
+
+def FFT_bitreverse(N, data):
+    n = N / 2
+    nm1 = n - 1
+    j = 0
+    for i in range(nm1):
+        ii = i << 1
+        jj = j << 1
+        k = n >> 1
+        if i < j:
+            tmp_real = data[ii]
+            tmp_imag = data[ii + 1]
+            data[ii] = data[jj]
+            data[ii + 1] = data[jj + 1]
+            data[jj] = tmp_real
+            data[jj + 1] = tmp_imag
+        while k <= j:
+            j -= k
+            k >>= 1
+        j += k
+
+def FFT_transform(N, data):
+    FFT_transform_internal(N, data, -1)
+
+def FFT_inverse(N, data):
+    n = N/2
+    norm = 0.0
+    FFT_transform_internal(N, data, +1)
+    norm = 1 / float(n)
+    for i in xrange(N):
+        data[i] *= norm
+
+def FFT(args):
+    N, cycles = map(int, args)
+    twoN = 2*N
+    x = Random(7).RandomVector(twoN)
+    for i in xrange(cycles):
+        FFT_transform(twoN, x)
+        FFT_inverse(twoN, x)
+    return 'FFT(%d, %d)' % (N, cycles)
+
diff --git a/talk/iwtc11/benchmarks/scimark/kernel.c b/talk/iwtc11/benchmarks/scimark/kernel.c
--- a/talk/iwtc11/benchmarks/scimark/kernel.c
+++ b/talk/iwtc11/benchmarks/scimark/kernel.c
@@ -37,6 +37,7 @@
             cycles *= 2;
 
         }
+        printf("FFT: N=%d, cycles=%d\n", N, cycles);
         /* approx Mflops */
 
         result = FFT_num_flops(N)*cycles/ Stopwatch_read(Q) * 1.0e-6;
diff --git a/talk/iwtc11/benchmarks/scimark/run_FFT.c b/talk/iwtc11/benchmarks/scimark/run_FFT.c
new file mode 100644
--- /dev/null
+++ b/talk/iwtc11/benchmarks/scimark/run_FFT.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include <assert.h>
+
+#include "Random.c"
+#include "FFT.c"
+
+int main(int ac, char **av) {
+    assert(ac==3);
+    int N = atoi(av[1]);
+    int cycles = atoi(av[2]);
+    int twoN = 2*N;
+    Random R = new_Random_seed(7);
+    double *x = RandomVector(twoN, R);
+    int i=0;
+
+    for (i=0; i<cycles; i++)
+    {
+        FFT_transform(twoN, x);     /* forward transform */
+        FFT_inverse(twoN, x);       /* backward transform */
+    }
+
+
+    fprintf(stderr, "FFT(%d,%d):    ", N, cycles);
+    return 0;
+}
+
+
diff --git a/talk/iwtc11/benchmarks/sqrt/sqrt.lua b/talk/iwtc11/benchmarks/sqrt/sqrt.lua
--- a/talk/iwtc11/benchmarks/sqrt/sqrt.lua
+++ b/talk/iwtc11/benchmarks/sqrt/sqrt.lua
@@ -1,3 +1,5 @@
+module(..., package.seeall);
+
 local bit = require("bit")
 local lshift, rshift, tobit = bit.lshift, bit.rshift, bit.tobit
 
@@ -103,4 +105,4 @@
     return string.format("%s", arg)
 end
 
-main(arg)
+--main(arg)
diff --git a/talk/iwtc11/benchmarks/test_scimark.py b/talk/iwtc11/benchmarks/test_scimark.py
--- a/talk/iwtc11/benchmarks/test_scimark.py
+++ b/talk/iwtc11/benchmarks/test_scimark.py
@@ -1,4 +1,5 @@
-from scimark import SOR_execute, Array2D, ArrayList, Random, MonteCarlo_integrate, LU_factor
+from scimark import SOR_execute, Array2D, ArrayList, Random, MonteCarlo_integrate, LU_factor, \
+        FFT_transform, FFT_inverse
 from array import array
 from cffi import FFI
 import os
@@ -9,21 +10,25 @@
     Random new_Random_seed(int seed);
     double Random_nextDouble(Random R);
     double **RandomMatrix(int M, int N, Random R);
+    double *RandomVector(int N, Random R);
 
     void SOR_execute(int M, int N,double omega, double **G, int num_iterations);
     double MonteCarlo_integrate(int Num_samples);    
     int LU_factor(int M, int N, double **A,  int *pivot);
+    void FFT_transform(int N, double *data);
+    void FFT_inverse(int N, double *data);
     """)
 C = ffi.verify("""
     #include <SOR.h>
     #include <Random.h>
     #include <MonteCarlo.h>
     #include <LU.h>
+    #include <FFT.h>
     """, 
     extra_compile_args=['-I' + os.path.join(os.getcwd(), 'scimark')],
     extra_link_args=['-fPIC'],
     extra_objects=[os.path.join(os.getcwd(), 'scimark', f) 
-                   for f in ['SOR.c', 'Random.c', 'MonteCarlo.c', 'LU.c']])
+                   for f in ['SOR.c', 'Random.c', 'MonteCarlo.c', 'LU.c', 'FFT.c']])
 
 class TestWithArray2D(object):
     Array = Array2D
@@ -82,4 +87,20 @@
     for n in [100, 200, 500, 1000]:
         assert C.MonteCarlo_integrate(n) == MonteCarlo_integrate(n)
 
+def test_fft():
+    rnd = C.new_Random_seed(7)
+    for n in [256, 512, 1024]:
+        data_c = C.RandomVector(n, rnd)
+        data_py = array('d', [0.0]) * n
+        for i in range(n):
+            data_py[i] = data_c[i]
+        C.FFT_transform(n, data_c)
+        FFT_transform(n, data_py)
+        for i in xrange(n):
+            assert data_py[i] == data_c[i]
+        C.FFT_inverse(n, data_c)
+        FFT_inverse(n, data_py)
+        for i in xrange(n):
+            assert data_py[i] == data_c[i]
 
+
diff --git a/talk/vmil2012/paper.tex b/talk/vmil2012/paper.tex
--- a/talk/vmil2012/paper.tex
+++ b/talk/vmil2012/paper.tex
@@ -110,7 +110,7 @@
 \begin{abstract}
 Tracing just-in-time (JIT) compilers record linear control flow paths,
 inserting operations called guards at points of possible divergence. These
-operations occur frequently generated traces and therefore it is important to
+operations occur frequently in generated traces and therefore it is important to
 design and implement them carefully to find the right trade-off between
 execution speed, deoptimization,
 and memory overhead.  In this paper we describe the design decisions about
@@ -121,6 +121,7 @@
 
 
 %___________________________________________________________________________
+\todo{better formatting for lstinline}
 \section{Introduction}
 
 Tracing just-in-time (JIT) compilers record and compile commonly executed
@@ -133,7 +134,7 @@
 This is done in the context of the RPython language and the PyPy project, which
 provides a tracing JIT compiler geared at dynamic language optimization.
 
-Our aim is to help understand the design constraints when implementing guards
+Our aim is to help understand the constraints when implementing guards
 and to describe the concrete techniques used in the various layers of RPython's
 tracing JIT. All design decisions will be motivated by concrete numbers for the
 frequency and the overhead related to guards.
@@ -155,23 +156,23 @@
 interpreter. Therefore guards need enough associated information to enable
 rebuilding the interpreter state. The memory overhead of this information
 should be kept low. These constraints and trade-offs are what make the design
-and optimization of guards an important and non-trivial aspect of the low-level
-design of a tracing just-in-time compiler.
+and optimization of guards an important and non-trivial aspect of the construction
+of a tracing just-in-time compiler.
 
 %Section~\ref{sec:Evaluation} presents Figures about the absolute number of
 %operations for each benchmark, and the overhead produced by the information
 %stored at the different levels for the guards
 In this paper we want to substantiate the aforementioned observations and
 describe based on them the reasoning behind the implementation of guards in
-RPython's tracing just-in-time compiler. The contributions of this paper are:
+RPython's tracing just-in-time compiler. the contributions of this paper are:
 \begin{itemize}
-  \item An analysis and benchmark of guards in the context of RPython's tracing JIT,
+  \item an analysis and benchmark of guards in the context of RPython's tracing JIT,
   %An analysis of guards in the context of RPython's tracing JIT to
   %substantiate the aforementioned observation, based on a set of benchmarks,
   \item detailed measurements about the frequency and the
   overhead associated with guards, and
   \item a description about how guards are implemented in the high\-
-  and low-level components of the JIT and a description of the rationale behind the design.
+  and low-level components of the JIT and describe the rationale behind the design
 \end{itemize}
 
 \begin{figure}
@@ -180,11 +181,11 @@
     \label{fig:guard_percent}
 \end{figure}
 
-The set of central concepts upon which this work is based is described in
+The set of central concepts upon which this work is based are described in
 Section~\ref{sec:Background}, such as the PyPy project, the RPython language
 and its meta-tracing JIT. Based on these concepts in Section~\ref{sec:Resume
 Data} we proceed to describe for RPython's tracing JIT the details of guards in
-the frontend related to recording and storing the
+the frontend. In this context the frontend is concerned with recording and storing the
 information required to rebuild the interpreter state in case of a guard
 failure. Once the frontend has traced and optimized a loop it invokes the
 backend to compile the operations to machine code, Section~\ref{sec:Guards in
@@ -207,7 +208,7 @@
 The RPython language and the PyPy project~\cite{rigo_pypys_2006} were started
 in 2002 with the goal of
 creating a Python interpreter written in a high level language, allowing easy
-language experimentation and extension. PyPy is now a fully compatible
+language experimentation and extension.\footnote{\url{http://pypy.org}} PyPy is now a fully compatible
 alternative interpreter for the Python language.
 Using RPython's tracing JIT compiler it is on average about 5 times faster than
 CPython, the reference implementation.
@@ -215,12 +216,12 @@
 features provided by RPython
 such as the provided tracing just-in-time compiler described below.
 
-RPython, the language and the toolset originally developed to implement the
+RPython, the language and the toolset originally created to implement the
 Python interpreter have developed into a general environment for experimenting
-and developing fast and maintainable dynamic language implementations. There
-are, besides the Python interpreter, experimental implementations of
-Prolog~\cite{bolz_towards_2010}, Javascript, R,
-Smalltalk~\cite{bolz_towards_2010} among other that are written in RPython at
+and developing fast and maintainable dynamic language implementations. Besides
+the Python interpreter there are several experimental language implementation at different
+levels of completeness, e.g. for Prolog~\cite{bolz_towards_2010}, Smalltalk~\cite{bolz_towards_2010}, JavaScript and R.
+
 different levels of completeness.
 
 RPython can mean one of two things:
@@ -258,7 +259,7 @@
 path, tracing is started thus recording all operations that are executed on this
 path. This includes inlining functional calls.
 As in most compilers, tracing JITs use an intermediate representation to
-store the recorded operations, which is typically in SSA
+store the recorded operations, typically in SSA
 form~\cite{cytron_efficiently_1991}. Since tracing follows actual execution the
 code that is recorded
 represents only one possible path through the control flow graph. Points of
@@ -273,9 +274,9 @@
 
 When the check of a guard fails, the execution of the machine code must be
 stopped and the control is returned to the interpreter, after the interpreter's
-state has been restored. If a particular guard fails often a new trace is
-recorded starting from the guard. We will refer to this kind of trace as a
-\emph{bridge}. Once a bridge has been traced it is attached to the
+state has been restored. If a particular guard fails often a new trace
+starting from the guard is recorded. We will refer to this kind of trace as a
+\emph{bridge}. Once a bridge has been traced and compiled it is attached to the
 corresponding guard by patching the machine code. The next time the guard fails
 the bridge will be executed instead of leaving the machine code.
 
@@ -328,21 +329,21 @@
 This information is called the \emph{resume data}.
 
 To do this reconstruction it is necessary to take the values of the SSA
-variables of the trace and build interpreter stack frames.  Tracing
+variables in the trace to build interpreter stack frames.  Tracing
 aggressively inlines functions, therefore the reconstructed state of the
 interpreter can consist of several interpreter frames.
 
 If a guard fails often enough, a trace is started from it
-forming a trace tree.
+to create a bridge, forming a trace tree.
 When that happens another use case of resume data
-is to construct the tracer state.
+is to reconstruct the tracer state.
 After the bridge has been recorded and compiled it is attached to the guard.
 If the guard fails later the bridge is executed. Therefore the resume data of
 that guard is no longer needed.
 
 There are several forces guiding the design of resume data handling.
 Guards are a very common operations in the traces.
-However, a large percentage of all operations
+However, as will be shown, a large percentage of all operations
 are optimized away before code generation.
 Since there are a lot of guards
 the resume data needs to be stored in a very compact way.
@@ -359,14 +360,14 @@
 The stack contains only those interpreter frames seen by the tracer.
 The frames are symbolic in that the local variables in the frames
 do not contain values.
-Instead, every local variables contains the SSA variable of the trace
+Instead, every local variable contains the SSA variable of the trace
 where the value would later come from, or a constant.
 
 \subsection{Compression of Resume Data}
 \label{sub:compression}
 
 After tracing has been finished the trace is optimized.
-During optimization a large percentage of operations can be removed.
+During optimization a large percentage of operations can be removed. \todo{add a reference to the figure showing the optimization rates?}
 In the process the resume data is transformed into its final, compressed form.
 The rationale for not compressing the resume data during tracing
 is that a lot of guards will be optimized away.
@@ -391,7 +392,7 @@
 comes from.
 The remaining 14 bits are a payload that depends on the tag bits.
 
-The possible source of information are:
+The possible sources of information are:
 
 \begin{itemize}
     \item For small integer constants
@@ -411,7 +412,7 @@
 Using many classical compiler optimizations the JIT tries to remove as many
 operations, and therefore guards, as possible.
 In particular guards can be removed by subexpression elimination.
-If the same guard is encountered a second time in the trace,
+If the same guard is encountered a second time in a trace,
 the second one can be removed.
 This also works if a later guard is weaker
 and hence implied by an earlier guard.
@@ -436,7 +437,7 @@
 Consequently the resume data needs to store enough information
 to make this reconstruction possible.
 
-Adding this additional information is done as follows:
+Storing this additional information is done as follows:
 So far, every variable in the symbolic frames
 contains a constant or an SSA variable.
 After allocation removal the variables in the symbolic frames can also contain
@@ -455,8 +456,8 @@
 During the storing of resume data virtual objects are also shared
 between subsequent guards as much as possible.
 The same observation as about frames applies:
-Quite often a virtual object does not change from one guard to the next.
-Then the data structure is shared.
+Quite often a virtual object does not change from one guard to the next,
+allowing the data structure to be shared.
 
 A related optimization is the handling of heap stores by the optimizer.
 The optimizer tries to delay stores into the heap as long as possible.
@@ -499,7 +500,7 @@
 \end{figure}
 
 
-After optimization the resulting trace is handed over to the platform specific
+After the recorded trace has been optimized it is handed over to the platform specific
 backend to be compiled to machine code. The compilation phase consists of two
 passes over the lists of instructions, a backwards pass to calculate live
 ranges of IR-level variables and a forward pass to emit the instructions. During
@@ -512,9 +513,9 @@
 emitted. Guards instructions are transformed into fast checks at the machine
 code level that verify the corresponding condition.  In cases the value being
 checked by the guard is not used anywhere else the guard and the operation
-producing the value can often be merged, further reducing the overhead of the guard.
-Figure~\ref{fig:trace-compiled} shows how the \texttt{int\_eq} operation
-followed by a \texttt{guard\_false} from the trace in Figure~\ref{fig:trace-log} are compiled to
+producing the value can merged, further reducing the overhead of the guard.
+Figure~\ref{fig:trace-compiled} shows how the \lstinline{int_eq} operation
+followed by a \lstinline{guard_false} from the trace in Figure~\ref{fig:trace-log} are compiled to
 pseudo-assembler if the operation and the guard are compiled separated or if
 they are merged.
 
@@ -558,11 +559,11 @@
 
 First a special data
 structure called \emph{backend map} is created. This data structure encodes the
-mapping from the IR-variables needed by the guard to rebuild the state to the
+mapping from IR-variables needed by the guard to rebuild the state to the
 low-level locations (registers and stack) where the corresponding values will
 be stored when the guard is executed.
 This data
-structure stores the values in a succinct manner using an encoding that uses
+structure stores the values in a succinct manner using an encoding that requires
 8 bits to store 7 bits of information, ignoring leading zeros. This encoding is efficient to create and
 provides a compact representation of the needed information in order
 to maintain an acceptable memory profile.
@@ -574,18 +575,18 @@
 backend map is loaded and after storing the current execution state
 (registers and stack) execution jumps to a generic bailout handler, also known
 as \emph{compensation code},
-that is used to leave the compiled trace in case of a guard failure.
+that is used to leave the compiled trace.
 
 Using the encoded location information the bailout handler reads from the
-saved execution state the values that the IR-variables had  at the time of the
+stored execution state the values that the IR-variables had at the time of the
 guard failure and stores them in a location that can be read by the frontend.
-After saving the information the control is passed to the frontend signaling
-which guard failed so the frontend can read the information passed and restore
+After saving the information the control is returned to the frontend signaling
+which guard failed so the frontend can read the stored information and rebuild
 the state corresponding to the point in the program.
 
-As in previous sections the underlying idea for the design of guards is to have
-a fast on-trace profile and a potentially slow one in the bailout case where
-the execution has to return to the interpreter due to a guard failure. At the same
+As in previous sections the underlying idea for the low-level design of guards is to have
+a fast on-trace profile and a potentially slow one in case
+the execution has to return to the interpreter. At the same
 time the data stored in the backend, required to rebuild the state, should be as
 compact as possible to reduce the memory overhead produced by the large number
 of guards, the numbers in Figure~\ref{fig:backend_data} illustrate that the
@@ -604,9 +605,9 @@
 main difference is the setup phase. When compiling a trace we start with a clean
 slate. The compilation of a bridge is started from a state (register and stack
 bindings) that corresponds to the state during the compilation of the original
-guard. To restore the state needed to compile the bridge we use the encoded
-representation created for the guard to rebuild the bindings from IR-variables
-to stack locations and registers used in the register allocator.  With this
+guard. To restore the state needed to compile the bridge we use the backend map
+created for the guard to rebuild the bindings from IR-variables
+to stack locations and registers.  With this
 reconstruction all bindings are restored to the state as they were in the
 original loop up to the guard. This means that no register/stack reshuffling is
 needed before executing a bridge.
@@ -643,8 +644,8 @@
 micro-benchmarks and larger programs.\footnote{\url{http://speed.pypy.org/}} The
 benchmarks were taken from the PyPy benchmarks repository using revision
 \texttt{ff7b35837d0f}.\footnote{\url{https://bitbucket.org/pypy/benchmarks/src/ff7b35837d0f}}
-The benchmarks were run on a version of PyPy based on the
-revision~\texttt{0b77afaafdd0} and patched to collect additional data about the
+The benchmarks were run on a version of PyPy based on
+revision~\texttt{0b77afaafdd0} and patched to collect additional data about
 guards in the machine code
 backends.\footnote{\url{https://bitbucket.org/pypy/pypy/src/0b77afaafdd0}} The
 tools used to run and evaluate the benchmarks including the patches applied to
@@ -690,11 +691,11 @@
   \item Guard failures are local and rare.
 \end{itemize}
 
-All measurements presented in this section do not take garbage collection of machine code into account. Pieces
+All measurements presented in this section do not take garbage collection of resume data and machine code into account. Pieces
 of machine code can be globally invalidated or just become cold again. In both
 cases the generated machine code and the related data is garbage collected. The
 figures show the total amount of operations that are evaluated by the JIT and
-the total amount of code and data that is generated from the optimized traces.
+the total amount of code and resume data that is generated.
 
 
 \subsection{Frequency of Guards}
@@ -708,10 +709,10 @@
 Figure~\ref{fig:benchmarks} extends Figure~\ref{fig:guard_percent} and summarizes the total number of operations that were
 recorded during tracing for each of the benchmarks and what percentage of these
 operations are guards. The number of operations was counted on the unoptimized
-and optimized traces. The Figure shows that the overall optimization rate for
+and optimized traces. The figure also shows the overall optimization rate for
 operations, which is between 69.4\% and 83.89\%, of the traced operations and the
 optimization rate of guards, which is between 65.8\% and 86.2\% of the
-operations, are very similar. This indicates that the optimizer can remove
+operations. This indicates that the optimizer can remove
 most of the guards, but after the optimization pass these still account for
 15.2\% to 20.2\% of the operations being compiled and later executed.
 The frequency of guard operations makes it important to store the associated
@@ -783,8 +784,7 @@
 \label{sub:guard_failure}
 The last point in this discussion is the frequency of guard failures.
 Figure~\ref{fig:failing_guards} presents for each benchmark a list of the
-relative amounts of guards that ever fail and of guards that fail often enough that a bridge is compiled.
-\footnote{
+relative amounts of guards that ever fail and of guards that fail often enough that a bridge is compiled.\footnote{
     The threshold used is 200 failures. This rather high threshold was picked experimentally to give
     good results for long-running programs.
 }
@@ -800,7 +800,7 @@
 \end{figure}
 
 From Figure~\ref{fig:failing_guards} we can see that only a very small amount
-of all the guards in the optimized traces ever fail. This amount varies between
+of all the guards in the compiled traces ever fail. This amount varies between
 2.4\% and 5.7\% of all guards. As can be expected, even fewer guards fail often
 enough that a bridge is compiled for them, only 1.2\% to 3.6\% of all guards
 fail often enough that a bridge is compiled. Also, of all failing guards a few fail extremely often
@@ -827,7 +827,7 @@
 compilers to represent possible divergent control flow paths.
 
 SPUR~\cite{bebenita_spur:_2010} is a tracing JIT compiler
-for a C\# virtual machine.
+for a CIL virtual machine.
 It handles guards by always generating code for every one of them
 that transfers control back to the unoptimized code.
 Since the transfer code needs to reconstruct the stack frames
@@ -845,20 +845,20 @@
 of snapshots for every guard to reduce memory pressure. Snapshots are only
 created for guards after updates to the global state, after control flow points
 from the original program and for guards that are likely to fail. As an outlook
-Pall mentions the plans to switch to compressed snapshots to further reduce
+Pall mentions plans to switch to compressed snapshots to further reduce
 redundancy. The approach of not creating snapshots at all for every guard is
 orthogonal to the resume data compression presented in this paper and could be
 reused within RPython to improve the memory usage further.
 
 Linking side exits to pieces of later compiled machine code was described first
-in the context of Dynamo~\cite{Bala:2000wv} under the name of Fragment Linking.
-Once a new hot trace is emitted into the fragment cache it is linked to side
-exit that led to the compilation of the fragment. Fragment Linking avoids the
+in the context of Dynamo~\cite{Bala:2000wv} under the name of fragment linking.
+Once a new hot trace is emitted into the fragment cache it is linked to the side
+exit that led to the compilation of the fragment. Fragment linking avoids the
 performance penalty involved in leaving the compiled code. Fragment linking
 also allows to remove compensation code associated to the linked fragments that
 would have been required to restored the execution state on the side exit.
 
-Gal et. al~\cite{Gal:2006} describe how in the HotpathVM they experimented
+Gal et. al~\cite{Gal:2006} describe how in the HotpathVM, a JIT for a Java VM, they experimented
 with having one generic compensation code block, like the RPython JIT, that
 uses a register variable mapping to restore the interpreter state. Later this
 was replaced by generating compensation code for each guard which produced a
@@ -933,16 +933,16 @@
 flow divergence in recorded traces.
 Based on the observation that guards are a frequent operation in traces and
 that they do not fail often, we described how they have been implemented in the
-high and low level components of RPython's tracing JIT compiler.
+high- and low-level components of RPython's tracing JIT compiler.
 
 Additionally we have presented experimental data collected using the standard PyPy
-benchmark set to evaluate previous observations and assumptions. Our
+benchmark set to evaluate previous observations and assumptions about guards. Our
 experiments confirmed that guards are a very common
 operation in traces. At the same time guards are associated with a high
 overhead, because for all compiled guards information needs to be
 stored to restore the execution state in case of a bailout. The measurements
 showed that the compression techniques used in PyPy effectively reduce the
-overhead of guards, but it still produces a significant overhead. The results
+overhead of guards, but they still produce a significant overhead. The results
 also showed that guard failure is a local event: there are few
 guards that fail at all, and even fewer that fail very often.
 These numbers validate the design decision of reducing the overhead of
@@ -961,7 +961,7 @@
 failure.
 
 \section*{Acknowledgements}
-We would like to thank David Edelsohn and Stephan Zalewski for their helpful
+We would like to thank David Edelsohn, Samuele Pedroni and Stephan Zalewski for their helpful
 feedback and valuable comments while writing this paper.
 
 %\section*{Appendix}


More information about the pypy-commit mailing list