Mailman 3 Parallel Hop and MPI Issues - yt-users

5 Oct 2012

      Hey everyone,

    I am trying to use Parallel Hop in YT to analyze enzo data.  I
installed mpi4py, forthon and did the whole "python setup.py install"
afterwards.  I next try to find halos with this code on 2 nodes with
16 processors each (32 total):

from yt.mods import *
from yt.analysis_modules.halo_finding.api import *

i = 5
filename = 'RD%04d/RedshiftOutput%04d' % (i,i)
pf = load(filename)
halos = parallelHF(pf)

dumpn = 'RD%04d/MergerHalos' %i
halos.dump(dumpn)

The output is rather long since it has 32 processors of output.   The
full output is here: http://paste.yt-project.org/show/2761/

  However, here are some highlights:

$ mpirun -np 32 python findhalo.py --parallel
Reported: 2 (out of 2) daemons -  32 (out of 32) procs
yt : [INFO     ] 2012-10-04 22:54:51,855 Global parallel computation
enabled: 1 / 32
yt : [INFO     ] 2012-10-04 22:54:51,855 Global parallel computation
enabled: 21 / 32
....
yt : [INFO     ] 2012-10-04 22:54:51,858 Global parallel computation
enabled: 10 / 32
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          mu0002.localdomain (PID 9624)
  MPI_COMM_WORLD rank: 3

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
P000 yt : [INFO     ] 2012-10-04 22:54:55,571 Parameters: current_time
             = 89.9505268216
P000 yt : [INFO     ] 2012-10-04 22:54:55,571 Parameters:
domain_dimensions         = [1024 1024 1024]
P000 yt : [INFO     ] 2012-10-04 22:54:55,572 Parameters:
domain_left_edge          = [ 0.  0.  0.]
P000 yt : [INFO     ] 2012-10-04 22:54:55,572 Parameters:
domain_right_edge         = [ 1.  1.  1.]
P000 yt : [INFO     ] 2012-10-04 22:54:55,573 Parameters:
cosmological_simulation   = 1
P000 yt : [INFO     ] 2012-10-04 22:54:55,573 Parameters:
current_redshift          = 5.99999153008
P000 yt : [INFO     ] 2012-10-04 22:54:55,573 Parameters: omega_lambda
             = 0.724
...
P000 yt : [INFO     ] 2012-10-04 23:04:33,681 Getting particle_index
using ParticleIO
P001 yt : [INFO     ] 2012-10-04 23:05:09,222 Getting particle_index
using ParticleIO
Traceback (most recent call last):
  File "findhalo.py", line 7, in <module>
    halos = parallelHF(pf)
  File "/usr/projects/magnetic/jsmidt/yt-x86_64/lib/python2.7/site-packages/yt-2.5dev-py2.7-linux-x86_64.egg/yt/analysis_modules/halo_finding/halo_objects.py",
line 2268, in __init__
    premerge=premerge, tree=self.tree)
  File "/usr/projects/magnetic/jsmidt/yt-x86_64/lib/python2.7/site-packages/yt-2.5dev-py2.7-linux-x86_64.egg/yt/analysis_modules/halo_finding/halo_objects.py",
line 1639, in __init__
    HaloList.__init__(self, data_source, dm_only)
  File "/usr/projects/magnetic/jsmidt/yt-x86_64/lib/python2.7/site-packages/yt-2.5dev-py2.7-linux-x86_64.egg/yt/analysis_modules/halo_finding/halo_objects.py",
line 1067, in __init__
    self._run_finder()
  File "/usr/projects/magnetic/jsmidt/yt-x86_64/lib/python2.7/site-packages/yt-2.5dev-py2.7-linux-x86_64.egg/yt/analysis_modules/halo_finding/halo_objects.py",
line 1648, in _run_finder
    if np.unique(self.particle_fields["particle_index"]).size != \
  File "/usr/projects/magnetic/jsmidt/yt-x86_64/lib/python2.7/site-packages/numpy/lib/arraysetops.py",
line 193, in unique
    return ar[flag]
MemoryError
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 6295 on node
mu0001.localdomain exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
32 total processes killed (some possibly by mpirun during cleanup)

   Anyways, if anyone recognizes this or has any advice it would be
appreciated.  Thanks.

-- 
------------------------------------------------------------------------
Joseph Smidt 

Theoretical Division
P.O. Box 1663, Mail Stop B283
Los Alamos, NM 87545
Office: 505-665-9752
Fax:    505-667-1931

Parallel Hop and MPI Issues

Joseph Smidt

Geoffrey So

tags

participants (2)