[pypy-dev] Talk in the Supercomputing Day, Madrid

Mon Jan 12 12:50:35 CET 2009

Hi, first I'd like to qualify myself as a student of Virtual Machine
implementations, not working (yet) on PyPy itself, and aware of some
HPC issues at a basic level. Still, I'd like to help pinpointing the
discussion.

On Mon, Jan 12, 2009 at 12:10, Guillem Borrell i Nogueras
<guillem at torroja.dmt.upm.es> wrote:
> Hi again

> Let's discuss the details.

> I'll try to explain why I've thought about pypy when planning the conference
> sessions.

> My work as Computational Fluid Dynamics researcher is intimately related to
> supercomputing for obvious reasons. Most of applications we work on are fine
> tuned codes with pieces that are more than twenty years old.  They are rock
> solid, fortan implemented, and run for hours in clusters of thousand of
> computing nodes.

> Since the last couple of years computer architectures are becoming more and more
> complex. I'm playing with the cell processor lately and that little bastard is
> causing me real pain. While programming is easier every day, supercomputing is
> harder and harder.  Think about an arhitecture like the roadrunner, AMD Opteron
> PPU with PowerPC SPU... Two assemblers in one chip!

> Talking with Stanley Ahalt (Ohio Supercomputing Center) about a year ago he
> called that the "software gap". In computing, as times goes, low performance is
> easier but high performance is harder. And that gap gets wider. Platform SDK are
> helpful but they are not a huge leap.

> I've always thought that virtual machines could help supercomputing like they
> have helped grid and cloud computing.  This is the point where I need someone to
> proof if I am wright or wrong.  Pypy is the most versatile, albeit complex,
> dynamic language implementation. I've been following the project during the last
> year and a half or so and I am impressed. I've thought that you could have a
> vision on how interpreted languages and virtual machines can help managing
> complexity.

> In addition, most of postprocessing tools are written in matlab, an interpreted
> language.  Running not-so-high performance tasks in a workstation efficiently is
> sometimes as important as running a simulation in a 12000 node supercomputer. It
> yould be nice if someone would remind the audience that matlab is not the only
> suitable (or best) tool for that job.

This is IMHO an issue with the expressivity of Matlab & Python. There
are Python libraries for that, but Python has not a Domain-Specific
syntax I guess - you'd need to spell out method names sometimes,
instead of using / and ./. Slicing is something already existing,
though.
However, JIT-compiled Python would have the huge advantage to make
also loops more efficient, instead of forcing the user to write all
loops in terms of matrix parallel operations. That was the biggest
slowdown in Matlab development I experienced. And I remember fixing
that in somebody else's program, getting from 12 hours to 1 minute of
runtime. 720x speedup. Interpreters can be much faster than that, even
CPython is already faster in that area.

The advantage of vectorized loops is that they easily use optimized
BLAS routines, maybe SSE/Cell processor based ones.

> I'm very interested in your comments.

Now, the first question is: do you have Python to be faster than
Matlab or faster than Fortran? VMs can be faster than C++ (for
instance, static C++ can't inline virtual methods).

That makes a huge difference. I'm unaware of why one can't use its
Fortran source on Cell. OK, I can wildly guess that having two
specialized microprocessors, one might like having an automatic
parallelizer to split certain operations to one and certain to
another? Can you make some better example?

The obvious ones are automatic vectorization and automatic
parallelization, but I'm not aware of any VM-specific research on that
topic. And automatic parallelization is a quite difficult topic
anyway, as far as I know.

In other words: is there any special advantage given by adaptive
optimization (even profile-based one) that static optimizations (like
the ones done by ICC) cannot match?
None is obvious for vectorization, automatic tuning of sizes comes to
mind for auto parallelization.

In general it is well known that the more a language is high-level,
the more informations the compiler has to optimize it, but also the
more the language has fancy features not trivial to optimize.
Actually, Fortran is better than C exactly because is more high-level.
Most C code assumes for instance GCC option -fno-strict-aliasing,
which is contrary to the language semantics and forbids many
interesting optimizations.

See this website for information (note: I found the link somewhere
else, but be careful that according to Google&Firefox, the server is
under control of hackers and is spreading viruses; I run Linux and I'm
safe, YMMV):
http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html

Having said that, the point is to understand which optimizations do
you need to perform manually and could be performed automatically by a
VM? Note that your Fortran compiler probably has a far better static
code optimizer. If you write plain Fortran code in Python, it's going
to be much slower (like if no optimization were present) until
dataflow analysis, register allocation, instruction scheduling etc.
will be implemented in PyPy, after all the rest is finished. It's just
a matter of implementation costs, but it is huge.
Where VM shine is when they can do optimizations unavailable to static
compilers, like adaptive optimizations. Inlining of virtual method is
an example, but automatic prefetching from memory into cache (by
adding SSE prefetch instructions) is another on which research has
been done, for instance, just to make an example.

Regards
-- 
Paolo Giarrusso