[pypy-dev] Interpretor for vectorized langugage

Sat Dec 18 10:48:00 CET 2010

Hi Armin, hi all,
below I give my 2 cents, giving some background on advanced
optimizations like cross-loop optimization and automatic
vectorization, which the others are mentioning without reference.
These optimizations are used for numerical code like the one discussed
(and mostly specific to that) in compiled languages like
C/C++/Fortran.

On Fri, Dec 17, 2010 at 13:44, Armin Rigo <arigo at tunes.org> wrote:
> Hi Alex,
>
> On Thu, Dec 16, 2010 at 8:28 PM, Alex Gaynor <alex.gaynor at gmail.com> wrote:
>>> Regarding this - I was thinking about haveing a + b - c create a
>>> bytecode that would be executed using small interpreter with a
>>> jit-merge-point and a hint "can be vectorized".
>>
>> That seems like a pretty big special case, why not work at the larger idea
>> of cross-loop optimizations?
>
> We have cross-loop optimizations.  Unless I'm misunderstanding you, we
> have already done it.  That's how for example map(f, somelist) gets
> JITted: it is a small loop that contains a call to the "normal" loop
> of the interpreter for f().  This is JITted as the loop from map()
> containing inline code from the interpreter loop.

Cross-loop optimization are useful to optimize C/Fortran numerical
code. For code like a + b - c, you want your compiler to optimize
together different loops from the application (the loop implementing
a+b and the one subtracting c from the result).
For instance, if a, b, c are vectors, a + b - c is equivalent (after
inlining) to the following code, which should be transformed into a
single loop, to reduce loop overhead but more importantly to perform
only one pass on d and thus improve cache locality for huge vectors,
not entirely fitting in CPU caches:

// a, b, c, d are vectors
for i in range(1, length):
  d[i] = a[i] + b[i]
for i in range(1, length):
  d[i] = d[i] - c[i]

optimized result:

for i in range(1, length):
  d[i] = a[i] + b[i] + c[i]

> I am a bit unsure however what is being discussed here, because
> fijal's comment is very terse and contains hidden issues, like when to
> create the bytecode (at runtime? but how can it be a green constant
> then?

I think he wants to do something similar to regexp compilation +
JITting as described on PyPy's blog.
One can compile a regexp to bytecode (at runtime), but afterwards the
result is constant for the regexp interpreter.

In this case, I read the statement as fijal wants to compile not a
regexp but the equivalent of a C++ expression template (see my other
email): for a + b - c, you'd get Minus(Plus(a, b), c), and hopefully
compile it directly to the efficient code above.

> And what is the hint "can be vectorized"?).

I can explain what "vectorized" means (it is a specific loop
optimization), but I'm not sure about why you want the annotation. A
reasonable possibility is to run this optimization pass only on the
code with such an annotation, because most code doesn't need it.

Most likely in this context, he refers to (automatic) loop
vectorization, i.e., automatically converting the code to use SIMD
instructions. On current processor, the following matrix code can be
implemented in one SIMD sum instruction, but the compiler has to
figure it out:

for i in range(1, 4):
  c[i] = a[i] + b[i]

In general, a SIMD operation operates in parallel on two short
operands of vectors, rather than on two single operands. The Intel
C/Fortran compilers, and more recently also GCC, can perform this
rewriting without annotations from the user.

See for instance:
http://en.wikipedia.org/wiki/Vectorization_(parallel_computing)
Introduction of:
http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i/
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/