Improving Python+MPI import performance
Hi all, (I originally posted this to the BayPIGgies list, where Fernando Perez suggested I send it to the NumPy list as well. My apologies if you're receiving this email twice.) I work on a Python/C++ scientific code that runs as a number of independent Python processes communicating via MPI. Unfortunately, as some of you may have experienced, module importing does not scale well in Python/MPI applications. For 32k processes on BlueGene/P, importing 100 trivial C-extension modules takes 5.5 hours, compared to 35 minutes for all other interpreter loading and initialization. We developed a simple pure-Python module (based on knee.py, a hierarchical import example) that cuts the import time from 5.5 hours to 6 minutes. The code is available here: https://github.com/langton/MPI_Import Usage, implementation details, and limitations are described in a docstring at the beginning of the file (just after the mandatory legalese). I've talked with a few people who've faced the same problem and heard about a variety of approaches, which range from putting all necessary files in one directory to hacking the interpreter itself so it distributes the module-loading over MPI. Last summer, I had a student intern try a few of these approaches. It turned out that the problem wasn't so much the simultaneous module loads, but rather the huge number of failed open() calls (ENOENT) as the interpreter tries to find the module files. In the MPI_Import module, we have rank 0 perform the module lookups and then broadcast the locations to the rest of the processes. For our real-world scientific applications written in Python and C++, this has meant that we can start a problem and actually make computational progress before the batch allocation ends. If you try out the code, I'd appreciate any feedback you have: performance results, bugfixes/feature-additions, or alternate approaches to solving this problem. Thanks! -Asher
Den 13.01.2012 02:13, skrev Asher Langton:
intern try a few of these approaches. It turned out that the problem wasn't so much the simultaneous module loads, but rather the huge number of failed open() calls (ENOENT) as the interpreter tries to find the module files.
It sounds like there is a scalability problem with imp.find_module. I'd report this on python-dev or python-ideas. Sturla
On Fri, Jan 13, 2012 at 19:41, Sturla Molden <sturla@molden.no> wrote:
Den 13.01.2012 02:13, skrev Asher Langton:
intern try a few of these approaches. It turned out that the problem wasn't so much the simultaneous module loads, but rather the huge number of failed open() calls (ENOENT) as the interpreter tries to find the module files.
It sounds like there is a scalability problem with imp.find_module. I'd report this on python-dev or python-ideas.
It's well-known. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 01/13/2012 02:13 AM, Asher Langton wrote:
Hi all,
(I originally posted this to the BayPIGgies list, where Fernando Perez suggested I send it to the NumPy list as well. My apologies if you're receiving this email twice.)
I work on a Python/C++ scientific code that runs as a number of independent Python processes communicating via MPI. Unfortunately, as some of you may have experienced, module importing does not scale well in Python/MPI applications. For 32k processes on BlueGene/P, importing 100 trivial C-extension modules takes 5.5 hours, compared to 35 minutes for all other interpreter loading and initialization. We developed a simple pure-Python module (based on knee.py, a hierarchical import example) that cuts the import time from 5.5 hours to 6 minutes.
The code is available here:
https://github.com/langton/MPI_Import
Usage, implementation details, and limitations are described in a docstring at the beginning of the file (just after the mandatory legalese).
I've talked with a few people who've faced the same problem and heard about a variety of approaches, which range from putting all necessary files in one directory to hacking the interpreter itself so it distributes the module-loading over MPI. Last summer, I had a student intern try a few of these approaches. It turned out that the problem wasn't so much the simultaneous module loads, but rather the huge number of failed open() calls (ENOENT) as the interpreter tries to find the module files. In the MPI_Import module, we have rank 0 perform the module lookups and then broadcast the locations to the rest of the processes. For our real-world scientific applications written in Python and C++, this has meant that we can start a problem and actually make computational progress before the batch allocation ends.
This is great news! I've forwarded to the mpi4py mailing list which despairs over this regularly. Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well? Then each node download that file once and consult it when looking for modules, instead of network file metadata. (In fact I think "texhash" does the same for LaTeX?) The disadvantage is that one would need to run "update-python-paths" every time a package is installed to update the text file. But I'm not sure if that that disadvantage is larger than remembering to avoid diverging import paths between nodes; hopefully one could put a reminder to run update-python-paths in the ImportError string.
If you try out the code, I'd appreciate any feedback you have: performance results, bugfixes/feature-additions, or alternate approaches to solving this problem. Thanks!
I didn't try it myself, but forwarding this from the mpi4py mailing list: """ I'm testing it now and actually running into some funny errors with unittest on Python 2.7 causing infinite recursion. If anyone is able to get this going, and could report successes back to the group, that would be very helpful. """ Dag Sverre
On 01/13/2012 09:19 PM, Dag Sverre Seljebotn wrote:
On 01/13/2012 02:13 AM, Asher Langton wrote:
Hi all,
(I originally posted this to the BayPIGgies list, where Fernando Perez suggested I send it to the NumPy list as well. My apologies if you're receiving this email twice.)
I work on a Python/C++ scientific code that runs as a number of independent Python processes communicating via MPI. Unfortunately, as some of you may have experienced, module importing does not scale well in Python/MPI applications. For 32k processes on BlueGene/P, importing 100 trivial C-extension modules takes 5.5 hours, compared to 35 minutes for all other interpreter loading and initialization. We developed a simple pure-Python module (based on knee.py, a hierarchical import example) that cuts the import time from 5.5 hours to 6 minutes.
The code is available here:
https://github.com/langton/MPI_Import
Usage, implementation details, and limitations are described in a docstring at the beginning of the file (just after the mandatory legalese).
I've talked with a few people who've faced the same problem and heard about a variety of approaches, which range from putting all necessary files in one directory to hacking the interpreter itself so it distributes the module-loading over MPI. Last summer, I had a student intern try a few of these approaches. It turned out that the problem wasn't so much the simultaneous module loads, but rather the huge number of failed open() calls (ENOENT) as the interpreter tries to find the module files. In the MPI_Import module, we have rank 0 perform the module lookups and then broadcast the locations to the rest of the processes. For our real-world scientific applications written in Python and C++, this has meant that we can start a problem and actually make computational progress before the batch allocation ends.
This is great news! I've forwarded to the mpi4py mailing list which despairs over this regularly.
Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well? Then each node download that file once and consult it when looking for modules, instead of network file metadata.
(In fact I think "texhash" does the same for LaTeX?)
The disadvantage is that one would need to run "update-python-paths" every time a package is installed to update the text file. But I'm not sure if that that disadvantage is larger than remembering to avoid diverging import paths between nodes; hopefully one could put a reminder to run update-python-paths in the ImportError string.
I meant "diverging code paths during imports between nodes".. Dag
If you try out the code, I'd appreciate any feedback you have: performance results, bugfixes/feature-additions, or alternate approaches to solving this problem. Thanks!
I didn't try it myself, but forwarding this from the mpi4py mailing list:
""" I'm testing it now and actually running into some funny errors with unittest on Python 2.7 causing infinite recursion. If anyone is able to get this going, and could report successes back to the group, that would be very helpful. """
Dag Sverre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Den 13.01.2012 21:21, skrev Dag Sverre Seljebotn:
Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well?
It probably would, and would also be less prone to synchronization problems than using an MPI broadcast. Another possibility would be to use a bsddb (or sqlite?) file as a persistent dict for caching the output of imp.find_module. Sturla
On 1/13/12 12:38 PM, Sturla Molden wrote:
Den 13.01.2012 21:21, skrev Dag Sverre Seljebotn:
Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well?
It probably would, and would also be less prone to synchronization problems than using an MPI broadcast. Another possibility would be to use a bsddb (or sqlite?) file as a persistent dict for caching the output of imp.find_module.
We tested something along those lines. Tim Kadich, a summer student at LLNL, wrote a module that went through the path and built up a dict of module->location mappings for a subset of module types. My recollection is that it worked well, and as you note, it didn't have the synchronization issues that MPI_Import has. We didn't fully implement it, since to handle complicated packages correctly, it looked like we'd either have to re-implement a lot of the internal Python import code or modify the interpreter itself. I don't think that MPI_Import is ultimately the "right" solution, but it shows how easily we can reap significant gains. Two better approaches that come to mind are: 1) Fixing this bottleneck at the interpreter level (pre-computing and caching the locations) 2) More generally, dealing with this as well as other library-loading issues at the system level, perhaps by putting a small disk near a node or small collection of nodes, along with a command to push (broadcast) some portions of the filesystem to these (more-)local disks. Basically, the idea would be to let the user specify those directories or objects that will be accessed by most of the processes and treated as read-only so that those objects can be cached near the node. -Asher
On Fri, Jan 13, 2012 at 21:20, Langton, Asher <langton2@llnl.gov> wrote:
2) More generally, dealing with this as well as other library-loading issues at the system level, perhaps by putting a small disk near a node or small collection of nodes, along with a command to push (broadcast) some portions of the filesystem to these (more-)local disks. Basically, the idea would be to let the user specify those directories or objects that will be accessed by most of the processes and treated as read-only so that those objects can be cached near the node.
Do these systems have a ramdisk capability? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Den 13.01.2012 22:24, skrev Robert Kern:
Do these systems have a ramdisk capability?
I assume you have seen this as well :) http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p... Sturla
On Fri, Jan 13, 2012 at 21:42, Sturla Molden <sturla@molden.no> wrote:
Den 13.01.2012 22:24, skrev Robert Kern:
Do these systems have a ramdisk capability?
I assume you have seen this as well :)
http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p...
I hadn't, actually! Good find! Actually, this same problem came up at the last SciPy conference from several people (Blue Genes are more common than I expected!), and the ramdisk was just my first idea. I'm glad people have evaluated it. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Den 13.01.2012 22:42, skrev Sturla Molden:
Den 13.01.2012 22:24, skrev Robert Kern:
Do these systems have a ramdisk capability? I assume you have seen this as well :)
http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p...
This paper also repeats a common mistake about the GIL: "A future challenge is the increasing number of CPU cores per node, which is normally addressed by hybrid thread and message passing based parallelization. Whereas message passing can be used transparently by both on Python and C level, the global interpreter lock in CPython limits the thread based parallelization to the C-extensions only. We are currently investigating hybrid OpenMP/MPI implementation with the hope that limiting threading to only C-extension provides enough performance." This is NOT true. Python threads are native OS threads. They can be used for parallel computing on multi-core CPUs. The only requirement is that the Python code calls a C extension that releases the GIL. We can use threads in C or Python code: OpenMP and threading.Thread perform equally well, but if we use threading.Thread the GIL must be released for parallel execution. OpenMP is typically better for fine-grained parallelism in C code and threading.Thread is better for course-grained parallelism in Python code. The latter is also where mpi4py and multiprocessing can be used. Sturla
On 01/14/2012 12:28 AM, Sturla Molden wrote:
Den 13.01.2012 22:42, skrev Sturla Molden:
Den 13.01.2012 22:24, skrev Robert Kern:
Do these systems have a ramdisk capability? I assume you have seen this as well :)
http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final.p...
This paper also repeats a common mistake about the GIL:
"A future challenge is the increasing number of CPU cores per node, which is normally addressed by hybrid thread and message passing based parallelization. Whereas message passing can be used transparently by both on Python and C level, the global interpreter lock in CPython limits the thread based parallelization to the C-extensions only. We are currently investigating hybrid OpenMP/MPI implementation with the hope that limiting threading to only C-extension provides enough performance."
This is NOT true.
Python threads are native OS threads. They can be used for parallel computing on multi-core CPUs. The only requirement is that the Python code calls a C extension that releases the GIL. We can use threads in C or Python code: OpenMP and threading.Thread perform equally well, but if we use threading.Thread the GIL must be released for parallel execution. OpenMP is typically better for fine-grained parallelism in C code and threading.Thread is better for course-grained parallelism in Python code. The latter is also where mpi4py and multiprocessing can be used.
I don't see how you contradict their statement. The only code that can run without the GIL is in C-extensions (even if it is written in, say, Cython). Dag Sverre
On 1/13/12 1:24 PM, Robert Kern wrote:
On Fri, Jan 13, 2012 at 21:20, Langton, Asher <langton2@llnl.gov> wrote:
2) More generally, dealing with this as well as other library-loading issues at the system level, perhaps by putting a small disk near a node or small collection of nodes, along with a command to push (broadcast) some portions of the filesystem to these (more-)local disks. Basically, the idea would be to let the user specify those directories or objects that will be accessed by most of the processes and treated as read-only so that those objects can be cached near the node.
Do these systems have a ramdisk capability?
That was another thing we looked at (but didn't implement): broadcasting the modules to each node and putting them in a ramdisk. The drawback (for us) is that we're already struggling with the amount of available memory per core, and according to the vendors, the situation will only get worse on future systems. The ramdisk approach might work well when there are lots of small objects that will be accessed. On 1/13/12 1:42 PM, Sturla Molden wrote:
Den 13.01.2012 22:24, skrev Robert Kern:
Do these systems have a ramdisk capability?
I assume you have seen this as well :)
http://www.cs.uoregon.edu/Research/paracomp/papers/iccs11/iccs_paper_final .pdf
I hadn't. Thanks! -Asher
On 01/13/2012 10:20 PM, Langton, Asher wrote:
On 1/13/12 12:38 PM, Sturla Molden wrote:
Den 13.01.2012 21:21, skrev Dag Sverre Seljebotn:
Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well?
It probably would, and would also be less prone to synchronization problems than using an MPI broadcast. Another possibility would be to use a bsddb (or sqlite?) file as a persistent dict for caching the output of imp.find_module.
We tested something along those lines. Tim Kadich, a summer student at LLNL, wrote a module that went through the path and built up a dict of module->location mappings for a subset of module types. My recollection is that it worked well, and as you note, it didn't have the synchronization issues that MPI_Import has. We didn't fully implement it, since to handle complicated packages correctly, it looked like we'd either have to re-implement a lot of the internal Python import code or modify the interpreter itself. I don't think that MPI_Import is ultimately the "right" solution, but it shows how easily we can reap significant gains. Two better approaches that come to mind are:
It's actually not too difficult to do something like LD_PRELOAD=myhack.so python something.py and have myhack.so intercept the filesystem calls Python makes (to libc) and do whatever it wants. That's a solution that doesn't interfer with how Python does its imports at all, it simply changes how Python perceives the world around it ("emulation", though much, much lighter). It does require some low-level C code, but there are several examples on the net. I know Ondrej Certik just implemented something similar. Note, I'm just brainstorming here and recording possible (and perhaps impossible) ideas in this thread -- the solution you have found is indeed a great step forward! Dag Sverre
1) Fixing this bottleneck at the interpreter level (pre-computing and caching the locations)
2) More generally, dealing with this as well as other library-loading issues at the system level, perhaps by putting a small disk near a node or small collection of nodes, along with a command to push (broadcast) some portions of the filesystem to these (more-)local disks. Basically, the idea would be to let the user specify those directories or objects that will be accessed by most of the processes and treated as read-only so that those objects can be cached near the node.
-Asher
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 1/13/12 1:58 PM, Dag Sverre Seljebotn wrote:
It's actually not too difficult to do something like
LD_PRELOAD=myhack.so python something.py
and have myhack.so intercept the filesystem calls Python makes (to libc) and do whatever it wants. That's a solution that doesn't interfer with how Python does its imports at all, it simply changes how Python perceives the world around it ("emulation", though much, much lighter).
It does require some low-level C code, but there are several examples on the net. I know Ondrej Certik just implemented something similar.
One of my colleagues suggested the LD_PRELOAD trick. I asked around here at LLNL, and I seem to recall hearing that the LD_PRELOAD trick didn't work on BlueGene/P, which is where the import bottleneck is the worst. That might have been incorrect though, since LD_PRELOAD is mentioned on Argonne's BG/P wiki. I'll have to look into this some more. -Asher
It is a straightforward thing to implement a "registry mechanism" for Python that by-passes imp.find_module (i.e. using sys.meta_path). You could imagine creating the registry file for a package or distribution (much like Dag described) and push that to every node during distribution. The registry file would have the map between package_name : file_location which would avoid all the failed open calls. You would need to keep the registry updated as Dag describes, but this seems like a fairly simple approach that should help. -Travis On Jan 13, 2012, at 2:38 PM, Sturla Molden wrote:
Den 13.01.2012 21:21, skrev Dag Sverre Seljebotn:
Another idea: Given your diagnostics, wouldn't dumping the output of "find" of every path in sys.path to a single text file work well?
It probably would, and would also be less prone to synchronization problems than using an MPI broadcast. Another possibility would be to use a bsddb (or sqlite?) file as a persistent dict for caching the output of imp.find_module.
Sturla
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (6)
-
Asher Langton
-
Dag Sverre Seljebotn
-
Langton, Asher
-
Robert Kern
-
Sturla Molden
-
Travis Oliphant