# [pypy-svn] r63577 - pypy/extradoc/talk/icooolps2009

cfbolz at codespeak.net cfbolz at codespeak.net
Fri Apr 3 19:07:05 CEST 2009

Author: cfbolz
Date: Fri Apr  3 19:07:04 2009
New Revision: 63577

Modified:
Log:
Some more references. Finally take a stab at the benchmark section.

==============================================================================
+++ pypy/extradoc/talk/icooolps2009/paper.bib	Fri Apr  3 19:07:04 2009
@@ -1,4 +1,13 @@
﻿
+ at phdthesis{carl_friedrich_bolz_automatic_2008,
+	type = {Master Thesis},
+	title = {Automatic {JIT} Compiler Generation with Runtime Partial Evaluation
+},
+	school = {{Heinrich-Heine-Universität} Düsseldorf},
+	author = {Carl Friedrich Bolz},
+	year = {2008}
+},
+
@inproceedings{ancona_rpython:step_2007,
title = {{RPython:} a step towards reconciling dynamically and statically typed {OO} languages},
@@ -57,6 +66,12 @@
year = {1999},
},

+ at inproceedings{andreas_gal_trace-based_2009,
+	title = {Trace-based {Just-in-Time} Type Specialization for Dynamic Languages },
+	author = {Andreas Gal and Brendan Eich and Mike Shaver and David Anderson and Blake Kaplan and Graydon Hoare and David Mandelin and Boris Zbarsky and Jason Orendorff and Michael Bebenita and Mason Chang and Michael Franz and Edwin Smith and Rick Reitmaier and Mohammad Haghighat},
+	year = {2009}
+},
+
@techreport{mason_chang_efficient_2007,
title = {Efficient {Just-In-Time} Execution of Dynamically Typed Languages
Via Code Specialization Using Precise Runtime Type Inference},
@@ -143,8 +158,8 @@
title = {A uniform approach for compile-time and run-time specialization},
url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.248},
doi = {10.1.1.103.248},
-	journal = {{PARTIAL} {EVALUATION,} {INTERNATIONAL} {SEMINAR,} {DAGSTUHL} {CASTLE,} {NUMBER} 1110 {IN} {LECTURE} {NOTES} {IN} {COMPUTER} {SCIENCE}},
-	author = {Charles Consel and Luke Hornof and Francois Noel and Jacques Noye and Nicolae Volanschi and Universite De Rennes Irisa},
+	journal = {Dagstuhl Seminar on Partial Evaluation},
+	author = {Charles Consel and Luke Hornof and François Noël and Jacques Noyé and Nicolae Volanschi},
year = {1996},
pages = {54---72}
},
@@ -234,18 +249,23 @@
url = {http://portal.acm.org/citation.cfm?id=237767},
doi = {10.1145/237721.237767},
abstract = {Note: {OCR} errors may be found in this Reference List extracted from the full text article. {ACM} has opted to expose the complete List rather than only correct and linked references.},
-	booktitle = {Proceedings of the 23rd {ACM} {SIGPLAN-SIGACT} symposium on Principles of programming languages},
+	booktitle = {Proceedings of the 23rd {ACM} {SIGPLAN-SIGACT} symposium on Principles of Programming Languages},
publisher = {{ACM}},
author = {Charles Consel and François Noël},
year = {1996},
pages = {145--156}
},

- at phdthesis{carl_friedrich_bolz_automatic_2008,
-	type = {Master Thesis},
-	title = {Automatic {JIT} Compiler Generation with Runtime Partial Evaluation
-},
-	school = {{Heinrich-Heine-Universität} Düsseldorf},
-	author = {Carl Friedrich Bolz},
-	year = {2008}
+ at inproceedings{chang_tracing_2009,
+	address = {Washington, {DC,} {USA}},
+	title = {Tracing for web 3.0: trace compilation for the next generation web applications},
+	isbn = {978-1-60558-375-4},
+	url = {http://portal.acm.org/citation.cfm?id=1508293.1508304},
+	doi = {10.1145/1508293.1508304},
+	abstract = {Today's web applications are pushing the limits of modern web browsers. The emergence of the browser as the platform of choice for rich client-side applications has shifted the use of in-browser {JavaScript} from small scripting programs to large computationally intensive application logic. For many web applications, {JavaScript} performance has become one of the bottlenecks preventing the development of even more interactive client side applications. While traditional just-in-time compilation is successful for statically typed virtual machine based languages like Java, compiling {JavaScript} turns out to be a challenging task. Many {JavaScript} programs and scripts are short-lived, and users expect a responsive browser during page loading. This leaves little time for compilation of {JavaScript} to generate machine code.},
+	booktitle = {Proceedings of the 2009 {ACM} {SIGPLAN/SIGOPS} international conference on Virtual execution environments},
+	publisher = {{ACM}},
+	author = {Mason Chang and Edwin Smith and Rick Reitmaier and Michael Bebenita and Andreas Gal and Christian Wimmer and Brendan Eich and Michael Franz},
+	year = {2009},
+	pages = {71--80}
}

==============================================================================
+++ pypy/extradoc/talk/icooolps2009/paper.tex	Fri Apr  3 19:07:04 2009
@@ -93,7 +93,8 @@
dynamic features of a language.

A recent approach to getting better performance for dynamic languages is that of
-tracing JIT compilers \cite{XXX}. Writing a tracing JIT compiler is relatively
+tracing JIT compilers \cite{gal_hotpathvm:effective_2006,
+mason_chang_efficient_2007}. Writing a tracing JIT compiler is relatively
simple. It can be added to an existing interpreter for a language,
the interpreter takes over some of the functionality of the compiler and
the machine code generation part can be simplified.
@@ -208,8 +209,9 @@
VMs \cite{gal_hotpathvm:effective_2006}. It also turned out that they are a
relatively simple way to implement a JIT compiler for a dynamic language
\cite{mason_chang_efficient_2007}. The technique is now
-being used by both Mozilla's TraceMonkey JavaScript VM \cite{XXX} and Adobe's
-Tamarin ActionScript VM \cite{XXX}.
+being used by both Mozilla's TraceMonkey JavaScript VM
+\cite{andreas_gal_trace-based_2009} and has been tried for Adobe's Tamarin
+ActionScript VM \cite{chang_tracing_2009}.

Tracing JITs are built on the following basic assumptions:

@@ -698,8 +700,65 @@

In this section we try to evaluate the work done so far by looking at some
benchmark numbers. Since the work is not finished, these benchmarks can only be
-preliminary. All benchmarking was done on a machine with a 1.4 GHz Pentium M
-processor and 1GiB RAM, using Linux 2.6.27.
+preliminary. All benchmarking was done on an otherwise idle machine with a 1.4
+GHz Pentium M processor and 1GiB RAM, using Linux 2.6.27.
+
+The first round of benchmarks (Figure \ref{fig:bench1}) are timings of the
+example interpreter (Figure \ref{fig:tlr-basic}) used in this paper computing
+the square of 46340 (the smallest number so that the square still fits into a 32
+bit word) using the bytecode of Figure \ref{fig:square}. The results for various
+constellations are as follows:
+
+\begin{enumerate}
+\item The interpreter translated to C without any JIT inserted at all.
+\item The tracing JIT is enabled, but no interpreter-specific
+hints are applied. This corresponds to the trace in Figure
+\ref{fig:trace-normal}. The time includes the time it takes to trace and the
+production of the machine code, as well as the fallback interpreter to leave the
+machine code. The threshold when to consider a loop to be hot is 40 iterations.
+\item The hints as in Figure \ref{fig:tlr-full} are applied, which means the loop of
+the square function is reflected in the trace. Constant folding of green
+variables is disabled though. This corresponds to the trace in Figure
+\ref{fig:trace-no-green-folding}. XXX
+\item Same as before, but with constant folding enabled. This corresponds to the
+trace in Figure \ref{fig:trace-full}. This speeds up the square function nicely,
+making it about six times faster than the pure interpreter.
+\item Same as before, but with the threshold set so high that the tracer is
+never invoked. This measures the overhead of the profiling. For this interpreter
+the overhead seems rather large, with 50\% slowdown du to profiling. This is
+because the example interpreter needs to do one hash table lookup per loop
+iteration. For larger interpreters (e.g. the Python one) it seems likely that
+the overhead is less significant, given that many operations in Python need
+hash-table lookups themselves.
+\item Runs the whole computation on the tracing interpreter for estimating the
+involved overheads of tracing. The trace is not actually recorded (which would be a
+memory problem), so in reality the number is even higher. Due to the double
+interpretation, the overhead is huge. It remains to be seen whether that will be
+a problem for practical interpreters.
+\item For comparison, the time of running the interpreter on top of CPython
+(version 2.5.2).
+\end{enumerate}
+
+\begin{figure}
+\noindent
+\begin{tabular}{|l|r|}
+\hline
+ &ratio\tabularnewline
+\hline
+Interpreter compiled to C, no JIT &1\tabularnewline \hline
+Normal Trace Compilation &1.20\tabularnewline \hline
+Unfolding of Language Interpreter Loop &XXX\tabularnewline \hline
+Full Optimizations &0.17\tabularnewline \hline
+Profile Overhead &1.51\tabularnewline \hline
+Interpreter run by Tracing Interpreter &860.20\tabularnewline \hline
+Interpreter run by CPython &256.17\tabularnewline \hline
+\end{tabular}
+\label{fig:bench1}
+\caption{Benchmark results of example interpreter computing the square of
+46340}
+\end{figure}
+
+

%- benchmarks
%    - running example