I'm breaking Ranger...
before I reply, do any of you have something intelligent to add that I can say? Because the .cpu files aren't spatially restricted, each thread of parallel HOP may have to access many of the .cpu files, which is how the MDS server gets pummeled.
_______________________________________________________
sskory(a)physics.ucsd.edu o__ Stephen Skory
http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
________________________________(_)_\(_)_______________
----- Forwarded Message ----
> From: Tommy Minyard <minyard(a)tacc.utexas.edu>
> To: "sskory(a)physics.ucsd.edu" <sskory(a)physics.ucsd.edu>
> Sent: Thursday, April 30, 2009 12:54:49 PM
> Subject: X1024 jobs on Ranger
>
> Hello Stephen,
>
> In our monitoring of Ranger, we've noticed that some of your recent jobs named
> x1024 seem to be causing an abnormally high load on the /scratch filesystem
> meta-data server (MDS). From our monitoring, it appears that the MDS load goes
> way up when your job initially begins to run for up to the first 30 minutes to
> hour, but then it drops back down to a more reasonable load after the job has
> been running for a while.
>
> Do you have any idea what may be triggering such a high load from the
> application you are running? It does not seem to cause any major problems or
> generate errors, however, the filesystem access becomes much more sluggish when
> the MDS load is so high. If you could give us a few more details that might
> help explain the high load or point us to the source code for your application,
> we want to check and confirm that the MDS is acting as it should.
>
> Thanks,
> Tommy
>
> ____________________________________________________________________
> Tommy Minyard, Ph.D. - Assoc. Director (512) 232-6578
> Advanced Computing Systems Group (512) 475-9445 (fax)
> Texas Advanced Computing Center http://www.tacc.utexas.edu
> The University of Texas at Austin minyard(a)tacc.utexas.edu
(I'd send this to -users, but since we're not emphasizing parallel hop just yet, I'll put it here.)
I am trying to run hop on Ranger on a 1024^3 AMR dataset I got from Robert Harkness. I've been running it with a varying number of threads, always greater than 64. I'm pretty sure it's not a memory problem, I ssh-ed into the head node and ran 'top.' I think I'll try this on Kraken tomorrow, but for now, can any of you make heads or tails of the error message below:
AttributeError: 'list' object has no attribute 'append'
When can a list not have 'append' as an attribute?
http://paste.enzotools.org/show/112/
I can sucessfully run parallel hop on L7 with this exact same install of yt.
The script I'm running is dead simple:
from yt.mods import *
pf = load("DD0082")
hop = HaloFinder(pf,padding=0.02)
hop.write_out(filename="benchmark-hop.out")
Britton, I know you've been doing some large-scale stuff lately. Have you run hop on something this large?
Thanks!
_______________________________________________________
sskory(a)physics.ucsd.edu o__ Stephen Skory
http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
________________________________(_)_\(_)_______________
Hi all,
With the new setup, I'm getting negative values for the center_of_mass() for some of the haloes in L7 running parallel HOP. It's a small fraction of the total. I am not getting this effect on smaller datasets for this reason.
Because L7 takes a while to run through HOP, I haven't been able to do much testing yet. I've only run this on Ranger. I'm going to try on Kraken soon, but I suspect I'll get the same thing. I also suspect I'd get the same problem with parallel FOF. I haven't tested this on a serial run on L7, because it takes a long time. I'd guess that it would work OK. I'll see if I can get that going, too.
Matt, can you think of anything you changed when you did the HaloFinding switchover that would affect this?
I'll see what I can figure out, but I just wanted to see if anyone (Matt in particular) had some ideas of where this is coming from before I jump headlong into this.
Thanks!
_______________________________________________________
sskory(a)physics.ucsd.edu o__ Stephen Skory
http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
________________________________(_)_\(_)_______________
Hi all,
I've been trying to get the new HOP/FoF implementation to run on L7 and on Kraken. I have found a couple problems.
The easy one is when trying to statically link both EnzoHop.a and EnzoFOF.a, it complains because they share the same kdtree naming structure. FoF should have its function names that conflict with HOP changed. I'll get around to changing this sometime. For now, I doubt anyone is using it. Alternatively, if they are identical enough, it may be good to use just one set of kdtree code for both. Thoughts?
Second, halo_list.write_out() is obtaining the particle_velocity_* fields serially, because now those fields aren't read in in parallel before calling the halo finder. What is the best way to read that data in parallel too, but only when its needed?
_______________________________________________________
sskory(a)physics.ucsd.edu o__ Stephen Skory
http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student
________________________________(_)_\(_)_______________
Hi guys,
I'd like to take a moment to sketch out my proposal for releasing
yt-1.5. We all run off trunk, but right now, 1.0 is the version
included with Enzo. Furthermore, the documentation has not been
substantially updated since 1.0, so almost all of the stuff we have
been doing and working on for the past year (yes, *year*) has not been
advertised or included. I'd like to compile some of the things that
have happened, most of which have been visible in the ticket tracker,
but not all.
http://yt.enzotools.org/query?status=closed&group=resolution&milestone=1.5
There are a couple things that don't show up quite so visibly:
* Fully parallel slices, projections, cutting planes, profiles,
quantities (although scaling will be a target of 2.0)
* Mostly parallel HOP (scaling is a major problem here)
* FoF halo finding
* Object storage and serialization
* Major performance improvements to the clump finder (factor ~5)
* Generalized domain sizes
* Generalized field info containers (which need some work for 2.0)
* Dark Matter-only simulations
* 1D and 2D simulations
* Better IO for HDF5 sets
* Support for the Orion AMR code
* Spherical re-gridding
* Halo profiler (ticket 95 needs some love)
* Light cone generator
* Callback interface improved
* Several new callbacks
* New data objects -- ortho and non-ortho rays, limited ray-tracing
* Fixed resolution buffers
* Spectral integrator for CLOUDY data
* Substantially better interactive interface
* Performance improvements basically everywhere
* Command-line interface to *many* common tasks
* Isolated plot handling, independent of PlotCollections
I'm pretty sure there's a lot more I'm forgetting -- but the
improvements have been numerous. I think the biggest thing to
emphasize here would be that yt works in parallel. Scaling is not
ideal. That's going to be a task for a later date, I do believe, and
for now I think it's important to push out the improvements --
frankly, I'm a little embarrassed people may be using yt-1.0 at this
point, because so much work has gone on in the trunk to improve
basically every aspect of the code. (And yes, our audience is small,
by almost all metrics: but I think this particular set of improvements
will help to increase it.) I think this release is much more
compelling than the last, and I think we all have something to be
proud of.
We still have to finish up these tickets:
http://yt.enzotools.org/report/12
Most or all of these are my repsonsibility; I'd like some help with
#95, though. Additionally, if anybody wants to volunteer to look at
(even if just attempting to replicate and getting proper tracebacks)
#189, #191 and #197, I'd really appreciate it. #208 might get bumped
to 2.0, since I'm the only one using fido anymore; I'd like to bring
it back in some way some day, but right now it's not a priority.
Now, for the meat of the remaining process: the documentation. This
will be primarily my responsibility, but I will need help from other
people. Specifically, I need someone to volunteer to read and edit
what I write. Additionally, I would like to submit a request that
sections of prose be written, in the style of a tutorial, by the
people who really are the experts on different sections of the code.
I've noticed a substantial uptick in feelings of ownership of various
sections, and I'm really happy about that. I think we really need
some semi-tutorial style sections written by the people who feel they
know best certain sections of the code that are really valuable and
useful. Here's my proposed set:
* Halo finding: Stephen
* Halo profiler: Britton
* Clump finder: Britton
* Light Cone Generator: Britton
* Interactive Usage: Jeff
Now, I know you are all very busy -- me too! -- so if this doesn't
work for you, it's all good, just let me know. :) I'll handle all
the format conversions and integration and so forth. The current
documentation repo (has not been put in SVN, but will be once it's
working better) is here: http://hg.enzotools.org/yt-doc/ .
Furthermore, there's a page set up for comments on the existing documentation:
http://yt.enzotools.org/wiki/Plans/DocEnhancements
Please, if you have any complaints about existing documentation,
making them there will ensure that while I go through the process of
editing and rewriting, I will ensure they are taken into account.
Additionally, while I am willing to take the burden of documentation,
I would really love it if somebody else would help out. So, let me
know if you are interested.
Okay guys, thanks so much for trudging through this email. Any
thoughts on any of this? Did I forget any big features? Anybody feel
like helping with extra documentation? Anything else that needs to go
in? We've all put a lot of work into this stuff, and I think we have
a great chance of making a big splash with a new release.
Thanks guys, and talk to you soon,
-Matt
Hi guys,
I've been exploring the idea of changing radius to be correct for
periodic boxes. Right now it is incorrect; each component (x,y,z)
should not have a distance greater than 0.5 * domain_size. The
easiest way to do this would be:
rx = min( abs( x - center_x ), abs( x - center_x - domain_x) )
I think the best way to do this is to set up a NumPy ufunc
http://docs.scipy.org/doc/numpy/user/c-info.beyond-basics.html#creating-a-n…
that accepts three arrays, along with the center and the domain width,
and then returns the radius. What do you all think? Alternatively,
I'm thinking maybe just a regular function that gets the arrays and
returns one would be better; the ufunc machinery is a bit complicated
and I might get confused. Once I come up with it, will somebody be
able to look over my work?
-Matt
Hi guys,
Britton and I worked on some improvements to hop today. Specifically,
changing the means of accessing attributes from *copying* to
*in-place* access. This should cut down on the memory substantially,
but I believe there are still places that it could be improved.
Specifically, I think a consolidation of the tags & iHop attributes
could improve things.
I've placed the patch at paste #97, and you additionally need:
http://yt.enzotools.org/files/hop_numpy.h
If y'all get a chance to look at it, see if it cuts down on memory
usage at all, that'd be awesome. We're currently running a bunch of
tests. If it works out, we're moving to this over the old method.
-Matt
Hi everyone,
Do any of you now, or have you in the past, used the 'source' keyword
as supplied to creation of a slice?
I have successfully finished the parallelization of both cutting
planes and slices, but to do the slices I had to (or chose to, really)
move from a 2d domain decomp to a standard grid iterator; this means I
got rid of source=. For 2.0, all the data objects will implement the
correct protocol, so source will come back, but for 1.5 I'd like to
put out a working parallel slice & cutting plane pair. But, I don't
want to cut off anybody's legs.
Thanks!
-Matt