[Baypiggies] Improving Python+MPI import performance

Thu Jan 12 21:23:20 CET 2012

Hi all,

I work on a Python/C++ scientific code that runs as a number of
independent Python processes communicating via MPI. Unfortunately, as
some of you may have experienced, module importing does not scale well
in Python/MPI applications. For 32k processes on BlueGene/P, importing
100 trivial C-extension modules takes 5.5 hours, compared to 35
minutes for all other interpreter loading and initialization. We
developed a simple pure-Python module (based on knee.py, a
hierarchical import example) that cuts the import time from 5.5 hours
to 6 minutes.

The code is available here:

https://github.com/langton/MPI_Import

Usage, implementation details, and limitations are described in a
docstring at the beginning of the file (just after the mandatory
legalese).

I've talked with a few people who've faced the same problem and heard
about a variety of approaches, which range from putting all necessary
files in one directory to hacking the interpreter itself so it
distributes the module-loading over MPI. Last summer, I had a student
intern try a few of these approaches. It turned out that the problem
wasn't so much the simultaneous module loads, but rather the huge
number of failed open() calls (ENOENT) as the interpreter tries to
find the module files. In the MPI_Import module, we have rank 0
perform the module lookups and then broadcast the locations to the
rest of the processes. For our real-world scientific applications
written in Python and C++, this has meant that we can start a problem
and actually make computational progress before the batch allocation
ends.

If you try out the code, I'd appreciate any feedback you have:
performance results, bugfixes/feature-additions, or alternate
approaches to solving this problem. Thanks!

-Asher