Hi all, I'm still trying to run HOP on the 1024^3 dataset on Ranger. It's been dying in two places inconsistently, meaning two consecutive runs will die either place. Line 196 in BaseDataTypes.py: self.fields.append(key) where it's complaining that self.fields doesn't have append. In HierarchyType.py at line 92: self.gridTree = [ [] for i in range(self.num_grids)] Sometimes it gives a memory error. self.num_grids is 440,000+, but in my testing it doesn't matter how lists it's trying to make, it that's where it wants to die it dies. I'm running on Kraken right now. We'll see how it goes. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
The second place should occur very early on in the instantiation of the hierarchy; you should be able to generate that error with just: pf = load("my_data") pf.h and it will die. Is that in fact the case? There are a couple things we can do to reduce the memory overhead. A while back I was working on them, and the nascent efforts are in yt-hierarchy-opt on hg.enzotools.org. The first problem you cite below is still odd to me. On Wed, May 6, 2009 at 11:35 AM, Stephen Skory <stephenskory@yahoo.com> wrote:
Hi all,
I'm still trying to run HOP on the 1024^3 dataset on Ranger. It's been dying in two places inconsistently, meaning two consecutive runs will die either place. Line 196 in BaseDataTypes.py:
self.fields.append(key)
where it's complaining that self.fields doesn't have append. In HierarchyType.py at line 92:
self.gridTree = [ [] for i in range(self.num_grids)]
Sometimes it gives a memory error. self.num_grids is 440,000+, but in my testing it doesn't matter how lists it's trying to make, it that's where it wants to die it dies.
I'm running on Kraken right now. We'll see how it goes.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
The second place should occur very early on in the instantiation of the hierarchy; you should be able to generate that error with just:
pf = load("my_data") pf.h
and it will die. Is that in fact the case?
No, it didn't die when I tried this. In related news, my attempt on Kraken just died, with an unhelpful error message. But it did get farther along than on Ranger, some of the threads started running HOP. On Ranger it has never gotten that far. I'll keep plugging away at this. Thanks! _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
I'll keep plugging away at this. Thanks!
There are several debugging techniques that need to be executed. I would recommend you instantiate the hierarchy interactively and examine the RAM in use. Load a single tile with varying sizes based on the number of processors, and see how many fields you can load before it dies. You should additionally consider using guppy to debug the memory. I believe fixing this problem will require an engaged approach. Until we can figure out where the memory is going, I don't think we can really fix it. Your statement that it did not die during hierarchy instantiation is upsetting to me. Previously, you said it died here: self.gridTree = [ [] for i in range(self.num_grids)] This happens during hierarchy instantiation. Can you please send the entire traceback from this crash? If it did NOT occur during hierarchy instantiation, but instead somewhere else, then we have an interesting problem. -Matt
There are several debugging techniques that need to be executed. I would recommend you instantiate the hierarchy interactively and examine the RAM in use.
I did this on the login node.
pf = load('DD0082') pf.h .... h.heap() Partition of a set of 9002467 objects. Total size = 1204579016 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 441317 5 462500216 38 462500216 38 dict of yt.lagos.HierarchyType.EnzoGrid 1 883469 10 248309432 21 710809648 59 dict (no owner) 2 2206122 25 176489760 15 887299408 74 numpy.ndarray 3 884892 10 112789472 9 1000088880 83 list 4 515901 6 46752208 4 1046841088 87 str 5 1767567 20 42421608 4 1089262696 90 numpy.float64 6 441319 5 38836072 3 1128098768 94 __builtin__.weakproxy 7 441317 5 31774824 3 1159873592 96 yt.lagos.HierarchyType.EnzoGrid 8 444668 5 10672032 1 1170545624 97 int 9 441319 5 10591656 1 1181137280 98 numpy.int32
1.2GB. Which is a fair amount to heft around per thread. I've done runs on Ranger and Kraken with up to 4GB per thread, which should be sufficient for the data I think.
Load a single tile with varying sizes based on the number of processors, and see how many fields you can load before it dies.
I'm not exactly sure what you mean by this. I however have been trying this script: http://paste.enzotools.org/show/121/ and it dies if RunHOP is turned on, runs fine to completion if I comment it out. Some of the threads run RunHOP before the thing dies. Here are the error messages I get with RunHOP on: http://paste.enzotools.org/show/122/ But those error messages aren't anything like I've seen when doing a real run of HOP. However, the error messages in those runs have been so cryptic and inconsistent I don't feel like I can say any of these errors are the same thing. I can say that I ran the script above twice and got the exact same error messages, which is better than with the regular HOP run. I just ran the script above on a very small dataset and it didn't crash, so I don't think there's anything inherently wrong with the script. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
I'm not exactly sure what you mean by this. I however have been trying this script:
http://paste.enzotools.org/show/121/
and it dies if RunHOP is turned on, runs fine to completion if I comment it out. Some of the threads run RunHOP before the thing dies. Here are the error messages I get with RunHOP on:
How big is DD0082? -Matt
I ran the script on Kraken. It ran with 128 threads and 8gb per thread. It gets stuck in HOP someplace (no error messages) with 64 threads/8gb on Kraken on one thread. That thread has the largest amount of particles, so it's likely that it is running out of memory.
How big is DD0082?
It is 150GB. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Okay. Here's where I'm at right now. I believe we encountered KeyboardInterrupt errors before when the script was being killed either by out-of-memory or memory-corruption issues. This would only be possible in the HOP code, which corresponds to what you are seeing in terms of the commenting out of RunHOP. Unfortunately, it's not easy for me to reproduce memory corruption here on such a large dataset. I am attempting to do so with the L7 RD0035 dataset. I will be doing this by running your script on four processors on one of our machines; unfortunately, all our multiproc machines also have lots of RAM. So I'm not sure I'll be able to get identical results, but I am trying. Are you running with vanilla trunk, and which revision? I'm on vanilla trunk r1297. -Matt On Thu, May 7, 2009 at 10:43 AM, Stephen Skory <stephenskory@yahoo.com> wrote:
I ran the script on Kraken. It ran with 128 threads and 8gb per thread. It gets stuck in HOP someplace (no error messages) with 64 threads/8gb on Kraken on one thread. That thread has the largest amount of particles, so it's likely that it is running out of memory.
How big is DD0082?
It is 150GB.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
I believe we encountered KeyboardInterrupt errors before when the script was being killed either by out-of-memory or memory-corruption issues. This would only be possible in the HOP code, which corresponds to what you are seeing in terms of the commenting out of RunHOP. Unfortunately, it's not easy for me to reproduce memory corruption here on such a large dataset. I am attempting to do so with the L7 RD0035 dataset. I will be doing this by running your script on four processors on one of our machines; unfortunately, all our multiproc machines also have lots of RAM. So I'm not sure I'll be able to get identical results, but I am trying.
Let me know if you want me to make DD0082 publicly readable on either Ranger or Kraken. It's fairly quick to move the data to any machine that has NSF GridFTP, too, that I have an account on.
Are you running with vanilla trunk, and which revision? I'm on vanilla trunk r1297.
It's r1295, so before all the h5py stuff, which shouldn't change any of this I think, with my modified HaloFinding to use preloading: http://paste.enzotools.org/show/123/ Everything else is spec r1295. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen, I've committed a change in r1298 that should substantially speed up the process of reading in particles. I'm still working on reproducing the error. -Matt On Thu, May 7, 2009 at 10:59 AM, Stephen Skory <stephenskory@yahoo.com> wrote:
Hi Matt,
I believe we encountered KeyboardInterrupt errors before when the script was being killed either by out-of-memory or memory-corruption issues. This would only be possible in the HOP code, which corresponds to what you are seeing in terms of the commenting out of RunHOP. Unfortunately, it's not easy for me to reproduce memory corruption here on such a large dataset. I am attempting to do so with the L7 RD0035 dataset. I will be doing this by running your script on four processors on one of our machines; unfortunately, all our multiproc machines also have lots of RAM. So I'm not sure I'll be able to get identical results, but I am trying.
Let me know if you want me to make DD0082 publicly readable on either Ranger or Kraken. It's fairly quick to move the data to any machine that has NSF GridFTP, too, that I have an account on.
Are you running with vanilla trunk, and which revision? I'm on vanilla trunk r1297.
It's r1295, so before all the h5py stuff, which shouldn't change any of this I think, with my modified HaloFinding to use preloading:
http://paste.enzotools.org/show/123/
Everything else is spec r1295.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
I have good but inconsistent news. I just got HOP to run for reals on Kraken using 128 threads, 8gb per thread. I could have sworn I've done jobs before with that much aggregate memory, but that may have been before I applied hop_numpy.h. Perhaps I just need to be really, really aggressive when it comes to memory usage. I'll bump up to r1298 and see if I can reproduce a successful run. It will take a bit due to the CNL procedure. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen, I'm hammering down on the memory usage now. There seems to be a bug somewhere in preloading that I am attempting to locate. We may be carrying around extra arrays with the new patch you sent. -Matt On Thu, May 7, 2009 at 11:37 AM, Stephen Skory <stephenskory@yahoo.com> wrote:
I have good but inconsistent news. I just got HOP to run for reals on Kraken using 128 threads, 8gb per thread. I could have sworn I've done jobs before with that much aggregate memory, but that may have been before I applied hop_numpy.h.
Perhaps I just need to be really, really aggressive when it comes to memory usage.
I'll bump up to r1298 and see if I can reproduce a successful run. It will take a bit due to the CNL procedure.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Matt,
I'm hammering down on the memory usage now. There seems to be a bug somewhere in preloading that I am attempting to locate. We may be carrying around extra arrays with the new patch you sent.
I've been suspicious of that because the preloading runs aren't any faster than the vanilla ones, but shouldn't they be substantially faster? Especially on Lustre where the major hangup is with opens? Thanks for this, by the way. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen, Remove all definitions of g_objs and replace g_objs in the call to _preload with self._data_source._grids . This is the fix to your patch; in r1300 I have made other necessary changes. -Matt On Thu, May 7, 2009 at 12:55 PM, Stephen Skory <stephenskory@yahoo.com> wrote:
Matt,
I'm hammering down on the memory usage now. There seems to be a bug somewhere in preloading that I am attempting to locate. We may be carrying around extra arrays with the new patch you sent.
I've been suspicious of that because the preloading runs aren't any faster than the vanilla ones, but shouldn't they be substantially faster? Especially on Lustre where the major hangup is with opens?
Thanks for this, by the way.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Additionally, to track when it is not using preloaded fields (and you might want to preload creation_time if necessary) you can add a line inside the function yt/lagos/DataReadingFuncs.py:readDataPacked that prints out the grid id, the field, and says it's hitting the C code. In a standard run of: yt hop my_data_file --parallel that function should never get called. If it is, we have not fixed it. -Matt On Thu, May 7, 2009 at 3:35 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi Stephen,
Remove all definitions of g_objs and replace g_objs in the call to _preload with self._data_source._grids .
This is the fix to your patch; in r1300 I have made other necessary changes.
-Matt
On Thu, May 7, 2009 at 12:55 PM, Stephen Skory <stephenskory@yahoo.com> wrote:
Matt,
I'm hammering down on the memory usage now. There seems to be a bug somewhere in preloading that I am attempting to locate. We may be carrying around extra arrays with the new patch you sent.
I've been suspicious of that because the preloading runs aren't any faster than the vanilla ones, but shouldn't they be substantially faster? Especially on Lustre where the major hangup is with opens?
Thanks for this, by the way.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Matt,
Additionally, to track when it is not using preloaded fields (and you might want to preload creation_time if necessary) you can add a line inside the function yt/lagos/DataReadingFuncs.py:readDataPacked that prints out the grid id, the field, and says it's hitting the C code.
Thanks! I'll take the changes for a whirl. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
participants (2)
-
Matthew Turk
-
Stephen Skory