(I'd send this to -users, but since we're not emphasizing parallel hop just yet, I'll put it here.) I am trying to run hop on Ranger on a 1024^3 AMR dataset I got from Robert Harkness. I've been running it with a varying number of threads, always greater than 64. I'm pretty sure it's not a memory problem, I ssh-ed into the head node and ran 'top.' I think I'll try this on Kraken tomorrow, but for now, can any of you make heads or tails of the error message below: AttributeError: 'list' object has no attribute 'append' When can a list not have 'append' as an attribute? http://paste.enzotools.org/show/112/ I can sucessfully run parallel hop on L7 with this exact same install of yt. The script I'm running is dead simple: from yt.mods import * pf = load("DD0082") hop = HaloFinder(pf,padding=0.02) hop.write_out(filename="benchmark-hop.out") Britton, I know you've been doing some large-scale stuff lately. Have you run hop on something this large? Thanks! _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen, I think this is the right location for this -- I believe it's a bug, and we should be addressing it as such. Are you using the old version of HOP, without the patch I constructed to do single-array addressing of particles? Can you tell me a bit more about the results of running top? How much free memory was there? Is there any reason to believe that the distribution of particles would be enough to compensate for this and blow out the ram on another node? As far as it being a bug, I honestly don't know -- lists should always have an append method. The only reason I can think of this being incorrect would be some problem in the C-backend, which could be manifesting in this way. I'll ask around. It's unclear to me if this is related, but I believe that HOP, right now, is too memory intensive. I think we can fix this, but it will require a bit of thought. Britton has been running into the same problems that you have. I'll see if I can profile the memory and get an idea what's going on. Please let us know what happens on Kraken. This is unacceptable and we need to fix it. -Matt On Wed, Apr 29, 2009 at 10:36 PM, Stephen Skory <stephenskory@yahoo.com> wrote:
(I'd send this to -users, but since we're not emphasizing parallel hop just yet, I'll put it here.)
I am trying to run hop on Ranger on a 1024^3 AMR dataset I got from Robert Harkness. I've been running it with a varying number of threads, always greater than 64. I'm pretty sure it's not a memory problem, I ssh-ed into the head node and ran 'top.' I think I'll try this on Kraken tomorrow, but for now, can any of you make heads or tails of the error message below:
AttributeError: 'list' object has no attribute 'append' When can a list not have 'append' as an attribute?
http://paste.enzotools.org/show/112/
I can sucessfully run parallel hop on L7 with this exact same install of yt.
The script I'm running is dead simple:
from yt.mods import *
pf = load("DD0082")
hop = HaloFinder(pf,padding=0.02)
hop.write_out(filename="benchmark-hop.out")
Britton, I know you've been doing some large-scale stuff lately. Have you run hop on something this large?
Thanks!
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Matt,
I think this is the right location for this -- I believe it's a bug, and we should be addressing it as such. Are you using the old version of HOP, without the patch I constructed to do single-array addressing of particles?
I am not, but it shouldn't matter. Following the output logs, it's crashing after the first read of particles for the unpadded regions, but before HOP gets called by any thread. It's crashing when its reading in the particle_position_* fields for the first time.
Can you tell me a bit more about the results of running top? How much free memory was there? Is there any reason to believe that the distribution of particles would be enough to compensate for this and blow out the ram on another node?
I ran top when I was doing 1 thread per node with 64 threads. The python process itself maxed out at nearly 20% of the machine before it crashed. The memory used line for the whole node was showing quite a bit more used, nearly 1/2 the node. I'd be surprised if the uneven distribution of particles was blowing out another node, but I'm not certain. I've run up to 256 threads, two per node, which would give each process 2x the memory as the time I watched with top above, and I got the same error message about the same place.
Please let us know what happens on Kraken. This is unacceptable and we need to fix it.
I'll let you know. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
I am not, but it shouldn't matter. Following the output logs, it's crashing after the first read of particles for the unpadded regions, but before HOP gets called by any thread. It's crashing when its reading in the particle_position_* fields for the first time.
I'd say run a test problem. Write a script that partitions the hierarchy. Read in a single position field. Then, copy it a few times (my_array.copy()) to see if it dies. I have never seen the error you are seeing, which is why I am inclined to think it's a memory issue. But I'm not sure.
I ran top when I was doing 1 thread per node with 64 threads. The python process itself maxed out at nearly 20% of the machine before it crashed. The memory used line for the whole node was showing quite a bit more used, nearly 1/2 the node. I'd be surprised if the uneven distribution of particles was blowing out another node, but I'm not certain. I've run up to 256 threads, two per node, which would give each process 2x the memory as the time I watched with top above, and I got the same error message about the same place.
Okay. That's very good to know. So maybe it's not a memory issue -- but the specific error is unclear to me. Since we can localize it to that point, maybe you should add in a dir() command on the object as well as some print statements to see where it gets. -Matt
Matt,
I'd say run a test problem. Write a script that partitions the hierarchy. Read in a single position field. Then, copy it a few times (my_array.copy()) to see if it dies. I have never seen the error you are seeing, which is why I am inclined to think it's a memory issue. But I'm not sure.
Is this what you had in mind? http://paste.enzotools.org/show/113/ This ran just fine using 32 threads on 16 nodes on Ranger. The two threads I watched with top maxed out at about 20% of the node each, with roughly 1/2 of the system memory shown as 'used.'
Since we can localize it to that point, maybe you should add in a dir() command on the object as well as some print statements to see where it gets.
dir() on which object? The hoplist? Or self.data_source in HaloFinding? _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen,
Is this what you had in mind?
Sure is.
This ran just fine using 32 threads on 16 nodes on Ranger. The two threads I watched with top maxed out at about 20% of the node each, with roughly 1/2 of the system memory shown as 'used.'
Okay, so that's roughly what we saw before. So it's probably not a memory issue.
dir() on which object? The hoplist? Or self.data_source in HaloFinding?
Sorry -- dir() on the so-called 'list' object that is throwing the error. I'd say dig in there, see if it's a Python error. If 'append' is in the dir but it throws an attribute error, then there's a bug in the interpreter or somehow yt is propagating a bug upwards. -Matt
Sorry to chime in so late. I have run hop on ranger on a 1024^3 unigrid about a week or so ago. I had to give it a relatively large amount of ram to make it go, but it did eventually work. However, when it failed, I was not getting that error. I believe it was something that specifically mentioned being out of ram. Unfortunately, I have to leave for the airport now, so I don't have time to add anything more useful at the moment. I will come back later and try to take a closer look. Britton On Thu, Apr 30, 2009 at 12:07 PM, Matthew Turk <matthewturk@gmail.com>wrote:
Hi Stephen,
Is this what you had in mind?
Sure is.
This ran just fine using 32 threads on 16 nodes on Ranger. The two threads I watched with top maxed out at about 20% of the node each, with roughly 1/2 of the system memory shown as 'used.'
Okay, so that's roughly what we saw before. So it's probably not a memory issue.
dir() on which object? The hoplist? Or self.data_source in HaloFinding?
Sorry -- dir() on the so-called 'list' object that is throwing the error. I'd say dig in there, see if it's a Python error. If 'append' is in the dir but it throws an attribute error, then there's a bug in the interpreter or somehow yt is propagating a bug upwards.
-Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Sorry to chime in so late. I have run hop on ranger on a 1024^3 unigrid about a week or so ago. I had to give it a relatively large amount of ram to make it go, but it did eventually work. However, when it failed, I was not getting that error. I believe it was something that specifically mentioned being out of ram. Unfortunately, I have to leave for the airport now, so I don't have time to add anything more useful at the moment. I will come back later and try to take a closer look. Britton On Thu, Apr 30, 2009 at 12:07 PM, Matthew Turk <matthewturk@gmail.com>wrote:
Hi Stephen,
Is this what you had in mind?
Sure is.
This ran just fine using 32 threads on 16 nodes on Ranger. The two threads I watched with top maxed out at about 20% of the node each, with roughly 1/2 of the system memory shown as 'used.'
Okay, so that's roughly what we saw before. So it's probably not a memory issue.
dir() on which object? The hoplist? Or self.data_source in HaloFinding?
Sorry -- dir() on the so-called 'list' object that is throwing the error. I'd say dig in there, see if it's a Python error. If 'append' is in the dir but it throws an attribute error, then there's a bug in the interpreter or somehow yt is propagating a bug upwards.
-Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (3)
-
Britton Smith
-
Matthew Turk
-
Stephen Skory