# [pypy-dev] Interpretor for vectorized langugage

Leon Sit wing1127aishi at gmail.com
Sat Dec 18 17:14:08 CET 2010

```My original intent is very much describe by Paolo. I want to use the idea of
template expression in c++ to apply to linear algebra code. The whole idea
of template expression is that all the overloaded operation does not trigger
immediate evaluation of the expression, but to return an expression instead
to next operator call. After chaining the entire formula, then the loop
unrolling starts to happen when caller invokes the operator[].

Sample

def fun(a,b,c,d): //say they are all matrix, and the formula might not make
sense but for sake of argument only
return a*a+b*c*b.T/ norm(d.T*d)

into low level C++

expr& fun(const expr& a, const expr& b, const expr&c, const expr& d){
return sum(prod(a*a), prod(b, prod(c, transpose(b))))/
norm(prod(transpose(d), d));
}

note that the type is "expression" in both parameters and input. The reason
is twofold
1) no eager evaluation on the subexpression, only evaluate the expression
when the entire expression is known
2) support lazy evaluation on the expression like

fun(x+y,y*y+y,x**y,y)[a,b]

where only the necessary computation for the entry [a,b] are calculated. For
better demo on this idea, say I have

def sum3(a,b,c):
return a+b+c

then sum3(x+y, y^2, 2*y)[1,10] should only trigger the evaluation of

return (x[1,10] + y[1,10]) + y[1,10]^2 + 2 * y[1, 10]

which itself can be optimised further and nothing else should be evaluated.
boost::ublas is rather a slow implementation of template expression idea and
the Eigen library has way more mature and efficient implementation. Without
this technique, numerical C library was never as fast as hand coded C and
Fortran code because hidden temporaries were everywhere in the formula. Only
after this template expressin technique, C++ matrix library are on par with
fortran and C. When the expression is long enough, C++ template expression
library gives much better than Lapack/atlas code because the compiler will
optimise the full expression.

I hand port an optimisation library from homebrew matrix library to ublas
and get 20x speed up, then moving to Eigen library get another 8x speed from
ublas (without using idiomatic Eigen style, only switch the type). Main
reason why Eigen is so much faster is that when unrolling the loop in Eigen,
SSE code are explicitly generated instead of depending compiler's auto
vectorisation which is not mature and only happen to simple code, openMP are
applied to matrix multiplication at the most outer loop, and data are
aligned in memory for less cache misses.

On Sat, Dec 18, 2010 at 3:48 AM, Paolo Giarrusso <p.giarrusso at gmail.com>wrote:

> Hi Armin, hi all,
> below I give my 2 cents, giving some background on advanced
> optimizations like cross-loop optimization and automatic
> vectorization, which the others are mentioning without reference.
> These optimizations are used for numerical code like the one discussed
> (and mostly specific to that) in compiled languages like
> C/C++/Fortran.
>
> On Fri, Dec 17, 2010 at 13:44, Armin Rigo <arigo at tunes.org> wrote:
> > Hi Alex,
> >
> > On Thu, Dec 16, 2010 at 8:28 PM, Alex Gaynor <alex.gaynor at gmail.com>
> wrote:
> >>> Regarding this - I was thinking about haveing a + b - c create a
> >>> bytecode that would be executed using small interpreter with a
> >>> jit-merge-point and a hint "can be vectorized".
> >>
> >> That seems like a pretty big special case, why not work at the larger
> idea
> >> of cross-loop optimizations?
> >
> > We have cross-loop optimizations.  Unless I'm misunderstanding you, we
> > have already done it.  That's how for example map(f, somelist) gets
> > JITted: it is a small loop that contains a call to the "normal" loop
> > of the interpreter for f().  This is JITted as the loop from map()
> > containing inline code from the interpreter loop.
>
> Cross-loop optimization are useful to optimize C/Fortran numerical
> code. For code like a + b - c, you want your compiler to optimize
> together different loops from the application (the loop implementing
> a+b and the one subtracting c from the result).
> For instance, if a, b, c are vectors, a + b - c is equivalent (after
> inlining) to the following code, which should be transformed into a
> single loop, to reduce loop overhead but more importantly to perform
> only one pass on d and thus improve cache locality for huge vectors,
> not entirely fitting in CPU caches:
>
> // a, b, c, d are vectors
> for i in range(1, length):
>  d[i] = a[i] + b[i]
> for i in range(1, length):
>  d[i] = d[i] - c[i]
>
> optimized result:
>
> for i in range(1, length):
>  d[i] = a[i] + b[i] + c[i]
>
> > I am a bit unsure however what is being discussed here, because
> > fijal's comment is very terse and contains hidden issues, like when to
> > create the bytecode (at runtime? but how can it be a green constant
> > then?
>
> I think he wants to do something similar to regexp compilation +
> JITting as described on PyPy's blog.
> One can compile a regexp to bytecode (at runtime), but afterwards the
> result is constant for the regexp interpreter.
>
> In this case, I read the statement as fijal wants to compile not a
> regexp but the equivalent of a C++ expression template (see my other
> email): for a + b - c, you'd get Minus(Plus(a, b), c), and hopefully
> compile it directly to the efficient code above.
>
> > And what is the hint "can be vectorized"?).
>
> I can explain what "vectorized" means (it is a specific loop
> optimization), but I'm not sure about why you want the annotation. A
> reasonable possibility is to run this optimization pass only on the
> code with such an annotation, because most code doesn't need it.
>
> Most likely in this context, he refers to (automatic) loop
> vectorization, i.e., automatically converting the code to use SIMD
> instructions. On current processor, the following matrix code can be
> implemented in one SIMD sum instruction, but the compiler has to
> figure it out:
>
> for i in range(1, 4):
>  c[i] = a[i] + b[i]
>
> In general, a SIMD operation operates in parallel on two short
> operands of vectors, rather than on two single operands. The Intel
> C/Fortran compilers, and more recently also GCC, can perform this
> rewriting without annotations from the user.
>
> See for instance:
> http://en.wikipedia.org/wiki/Vectorization_(parallel_computing)
> Introduction of:
>
> http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i/
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20101218/e4a4e31b/attachment.html>
```