A couple people have mentioned to me recently (in IRC and over email)
that the stable branch is missing some fixes for FLASH and Enzo data.
I'd like to start a discussion on wrapping up the dev cycle and moving
for a 2.4 release. The showpoint of this release I believe will be
improvements to volume rendering; more on that below.
Here are the open, targeted bugs for 2.4:
These touch on a few major areas:
* Minor improvements to reason and reason widgets: Adding vectors,
layers (which should get postponed) a volume rendering widget (should
get postponed) and a file open dialog (Nathan had been working a bit
* FLASH updates to make it more user friendly: John Z and I had a
brainstorming session after I spoke to some FLASH users last week, and
I filled out tickets to try to address any remaining weak points.
* Time Series updates: Nathan has made a few suggestions, and I'm
going to implement them. I like where Time Series is going!
* Volume rendering refactor (Sam, if you have a few minutes, could
you give an update on what that includes and what it provides? This
could be a great centerpiece of 2.4!)
* Rockstar integration is incomplete and undocumented, but that may
also be postponed.
There are also a few bugs that need to be addressed:
* MPI attributes now break pickling in some cases; I will address this
* Ghost zone issues seem to be everywhere, as corner cases keep
coming up. John W and I have spoken about this and we have an idea
for how to fix it moving forward.
* Quadproj scales poorly, but that'll be a bit of a tricky piece of work.
Does anyone have any comments or ideas about these bugs, or want to
tackle any of them? Are there any other concerns, or thoughts on
How would aiming to release in a month from now sound? April 16 is a
Monday, which would be a good day for a release.
In general, I agree with the idea Nathan put out. (Also, I think this
is a fine time to have a bikeshed discussion. Many of the underlying
assumptions about how yt works were laid out a long time ago.) But,
I'm not entirely sure I understand how different it would be --
conceptually, yes, I see what you're getting at, that we'd have a set
number of attributes. In what I was thinking of for the geometry
refactor so far I'm trying to get rid of the "hierarchy" as existing
for every data set, and instead relying on what amounts to an
object-finder and io-coordinator, which I'm calling a geometry
handler. It sounds like what you would like is:
1) Get rid of accessing parameters with an implicit __getitem__ on the
parameter file (i.e., pf["SomethingThatOnlyExistsInOneCode"]). I'm
+10 on this.
2) Move units into the .units object (I'm mostly with Casey on this,
but I think it should be a part of the field_info object)
3) Have things like current_time, domain_dimensions and so on move
into basic_info and make them dict objects.
I think of those, I'm in favor of one and two, but somewhat opposed to
#3. Right now we have these attributes mandated for subclasses of
The only ones here that I think would be okay to move out of
properties would be the cosmology items, and even those I'm -0 on
But, in general, the idea of moving from this two-stage system of
parameter file (rather than dataset) and hierarchy (rather than an
implicitly-handled geometry) is something I am in support of. The
geometry is something that should nearly *always* be handled by the
backend, rather than by the user. So having the library require
pf.h.sphere(...) is less than ideal, since it's exposing something
relatively unfortunate (that building a hundred thousand grid objects
can take some time).
The main ways that the static output is interacted with:
* Parameter information specific to a simulation code
* Properties that yt needs to know about
* To get at the hierarchy
* Input to plot collections
The main ways that the hierarchy is interacted with:
* Getting data objects
* Finding max
* Statistics about the simulation
* Inspecting individual grids (much less common use case now that it was before)
All of these use cases are still valid, but I think it's clear that
accessing individual grids and accessing simulation-specific
parameters are not "generic" functions. What a lot of this discussion
has really brought up for me is that we're talking about *generic*
functionality, not code-specific functionality, and we right now do
not have the best enumeration of functionality and where it lies.
With the geometry_refactor, I'd like to consolidate functionality into
the main "dataset" object. The geometry can still provide access to
the individual grids (of course) but data objects, finding max,
getting stats about the simulation, etc, should all go into the main
dataset object, and the geometry handler can simply be created on the
fly if necessary.
This brings up two points, though --
1) Does our method of instantiating objects still hold up? i.e.,
ds.sphere(...) and so on? Or does our dataset object then become
overcrowded? I would also like to move *all* plotting objects into
whatever we end up deciding is the location data containers come from,
which for instance could look like ds.plot("slice", "x") (for
instance, although we can bikeshed that later), which would return a
2) Datasets and time series should behave, if not identically, at
least consistently in their APIs. Moving to a completely ds-mediated
mechanism for generating, accessing and inspecting data opens up the
ability to then construct very nice and simply proxy objects. As an
example, while something this is currently technically possible with
the current Time Series API, it's a bit tricky:
ts = TimeSeriesData.from_filenames(...)
plot = ts.plot("slice", "x", (100.0, 'au'))
ts.seek(dt = (100, 'years'))
ts.seek(dt = (10, 'years'))
(The time-slider, as Tom likes to call it ...)
In general, this idea of moving toward more thoughtful
dataset-construction, rather than the hokey parameter file + hierarchy
construction brings with it a mindset shift which I'd like to spread
to the time series, which can continue to be a focus.
What do you think?
On Thu, Mar 29, 2012 at 7:08 PM, Casey W. Stark <caseywstark(a)gmail.com> wrote:
> +1 on datasets, although I would like to see the unit object(s) at the field
> On Thu, Mar 29, 2012 at 4:04 PM, Cameron Hummels
> <chummels(a)astro.columbia.edu> wrote:
>> +1 on datasets.
>> On 3/29/12 6:58 PM, Nathan Goldbaum wrote:
>>> +1. I'd also be up to help out with the sprint. Doing a virtual sprint
>>> using a google hangout might help mitigate some of the distance problems.
>>> While we're brining up Enzo-isms that we should get rid of, I think it
>>> might be a good idea to make a conceptual shift in the basic python UI.
>>> Instead referring to the interface between the user and the data as a
>>> parameter file, I think instead we should be talking about datasets. One
>>> would instantiate a dataset just like we do now with parameter files:
>>> ds = load(filename)
>>> A dataset would also have some universal attributes which would present
>>> themselves to the user as a dict, e.g. ds.units, ds.parameters,
>>> ds.basic_info (like current_time, timestep, filename, and simulation code),
>>> and ds.hierarchy (not sure how that would interfere with the geometry
>>> This may be a paintibg the bike shed discussion, but I think this shift
>>> will help new users understand how to access their data. Thoughts?
>>> On Mar 29, 2012, at 3:40 PM, Matthew Turk<matthewturk(a)gmail.com> wrote:
>>>> Hi Nathan and Casey,
>>>> I agree with what both of you have said. The Orion/Nyx units should
>>>> be made to be consistent, but more importantly I think we should
>>>> continue breaking away from Enzo-isms in the code.
>>>> As it stands, all of the universal fields call underlying Enzo-named
>>>> aliases -- Density, ThermalEnergy, etc etc. I hope we can have a 3.0
>>>> out within a calendar year, hopefully by the end of this year. (I've
>>>> been pushing on the geometry refactor, although recently other efforts
>>>> have been paying off which has decreased my output there.) I am much,
>>>> much less doubtful than Casey is that we cannot do this; in fact, I'm
>>>> completely in favor of this and I think it would be relatively
>>>> straightforward to implement.
>>>> In the existing system we have a mechanism for aliasing fields. What
>>>> we can do is provide an additional translation system where we
>>>> enumerate the fields that are available for items in UniversalFields,
>>>> and then construct aliases to those. This would mean changing what is
>>>> aliased in existing non-Enzo frontends, and adding aliases in Enzo.
>>>> The style of name Casey proposes is what I woudl also agree with:
>>>> underscores, lower cases, and erring on the side of verbosity. The
>>>> fields off hand that we would need to do this for (in their current
>>>> x-velocity => velocity_x (same for y, z)
>>>> Density => density
>>>> TotalEnergy => ?
>>>> GasEnergy => thermal_energy_specific (and thermal_energy_density)
>>>> Temperature => temperature
>>>> and so on.
>>>> Once we have these aliases in place, an overall cleanup of
>>>> UniversalFields should take place. One place we should clean up is
>>>> ensuring that there are no conditionals; rather than conditionals
>>>> inside the functions, we should place those conditionals inside the
>>>> parameter file types. So for instance, if you have a field that is
>>>> calculated differently depending on the parameter HydroMethod (in Enzo
>>>> for instance) you simply set a validator on the field requiring the
>>>> parameter be set to a particular value, and then only the field which
>>>> satisfies that validator will be called when requested.
>>>> So we've gotten rid of a bunch of enzo-isms in the parameter files;
>>>> after fields, what else can we address? And, I'd be up for sprinting
>>>> on this (which should take just a few hours) basically any time next
>>>> week or after. I'd also be up for talking more about geometry
>>>> refactoring, if anyone is interested, but it's not quite to the point
>>>> that I think I am satisfied enough with the architecture to request
>>>> input / contributions. Sometimes (especially with big architectural
>>>> things like this) I think it's a shame we do all of our work
>>>> virtually, as I think a lot of this would be easier to bang out in
>>>> person for a couple hours.
>>>> On Wed, Mar 28, 2012 at 6:14 PM, Casey W. Stark<caseywstark(a)gmail.com>
>>>>> Hi Nathan.
>>>>> I'm also worried about this and I agree that fields with the same name
>>>>> should all be consistent. I would support some sort of cleanup of
>>>>> fields, and I can get the Nyx fields in line and help with Enzo.
>>>>> I doubt we can do this, but I would prefer changing the field names as
>>>>> of the removing enzo-isms and geometry handling refactoring pushes. For
>>>>> instance, the field in Orion could be thermal_energy_density and the
>>>>> in Enzo could be specific_thermal_energy. I also noticed this issue
>>>>> when I
>>>>> was using "Density" in Enzo (proper density in cgs) and "density" in
>>>>> (comoving density in cgs).
>>>>> On Wed, Mar 28, 2012 at 1:47 PM, Nathan Goldbaum<goldbaum(a)ucolick.org>
>>>>>> Hi all,
>>>>>> On IRC today we noticed that Orion defines its ThermalEnergy field per
>>>>>> unit volume but Enzo and FLASH define ThermalEnergy per unit mass. Is
>>>>>> a problem? Since yt defaults to the Enzo field names, should we try
>>>>>> to make
>>>>>> sure that all fields are defined using the same units as in Enzo? Is
>>>>>> a convention for how different codes should define derived fields that
>>>>>> aliased to Enzo fields?
>>>>>> One problem for this particular example is that the Pressure field is
>>>>>> defined in terms of ThermalEnergy in universal_fields.py so the units
>>>>>> ThermalEnergy become important if a user merely wants the gas pressure
>>>>>> the simulation.
>>>>>> One possible solution for this issue would be the units overhaul we're
>>>>>> planning. If all fields are associated with a unit object, we can
>>>>>> query the units to ensure that units are taken care of correctly and
>>>>>> code-to-code comparisons aren't sensitive to the units chosen for
>>>>>> fields in
>>>>>> the frontend.
>>>>>> Personally, I think it would be best if we could make sure that all of
>>>>>> fields aliased to Enzo fields have the same units.
>>>>>> Nathan Goldbaum
>>>>>> Graduate Student
>>>>>> Astronomy& Astrophysics, UCSC
>>>>>> yt-dev mailing list
>>>>> yt-dev mailing list
>>>> yt-dev mailing list
>>> yt-dev mailing list
>> yt-dev mailing list
> yt-dev mailing list
Hello yt developers,
I don't know if this is a "bug," exactly, but I noticed an issue when doing
a clean install of yt today involving the h5py install. Namely, it looks
like if you have an non-empty CFLAGS environment variable (which can happen
without your knowledge if you load certain module files in a supercomputing
environment, for example), the h5py build will proceed without getting the
"-fno-strict-aliasing" flag and the resulting module will not work. You can
get around this either by 1) clobbering CFLAGS, or 2) adding
"-fno-strict-aliasing" to it and re-running the script.
It seems like the install script should either detect this and work around
it, or else warn you that the h5py build has gone of the rails after the
installation, because the tracebacks you get from trying to use the
whacked-out h5py module are not very illuminating. I would do this myself,
but I'm not sure my shell scripting skills are up to the task.
On IRC today we noticed that Orion defines its ThermalEnergy field per unit volume but Enzo and FLASH define ThermalEnergy per unit mass. Is this a problem? Since yt defaults to the Enzo field names, should we try to make sure that all fields are defined using the same units as in Enzo? Is there a convention for how different codes should define derived fields that are aliased to Enzo fields?
One problem for this particular example is that the Pressure field is defined in terms of ThermalEnergy in universal_fields.py so the units of ThermalEnergy become important if a user merely wants the gas pressure in the simulation.
One possible solution for this issue would be the units overhaul we're planning. If all fields are associated with a unit object, we can simply query the units to ensure that units are taken care of correctly and code-to-code comparisons aren't sensitive to the units chosen for fields in the frontend.
Personally, I think it would be best if we could make sure that all of the fields aliased to Enzo fields have the same units.
Astronomy & Astrophysics, UCSC
I tried using macports to install yt and ran into a couple of issues.
First, there was a conflict because I had hdf5 installed to use flash and yt required hdf5-18. Secondly, after uninstalling hdf5 to see what would happen, I received these errors:
Error: Checksum (md5) mismatch for yt-2.2.tar.gz
Error: Checksum (sha1) mismatch for yt-2.2.tar.gz
Error: Checksum (rmd160) mismatch for yt-2.2.tar.gz
Error: Target org.macports.checksum returned: Unable to verify file checksums
Error: Status 1 encountered during processing.
This cycle I applied to Amazon Research, and last week I received
notice that the grant was accepted (woo hoo!) Using the AWS credits,
I've been able to deploy an alpha version of a "data hub" on Amazon
EC2. It's nearly fully backed by S3, SimpleDB and EC2 (only user
authentication is stored on the instance, which I'd like to change
The idea behind this is to make an easy way to share *data*, not just
images, scripts, and so on. It's not designed to be robust for many
years (like a proper archiving solution would be) and it's not
designed to be fully generic, but rather it's designed ... as a
pastebin for data. You construct a widget, a representation of a data
object, and then you can shove it up there and display the data
through the widget.
Right now it contains widgets for:
* 3D Vertices, displayed with WebGL
* Variable mesh maps (i.e., the mapserver)
* Image collections
* Parameter files
There are a number of places where it's not yet finished: the vertices
and image collections haven't been wrapped into yt proper but as it
stands, yt can upload both parameter files and variable mesh (slices,
projections) really easily.
The Data Hub has been deployed here:
and some example scripts are here:
You might get errors about the certificate, which is self-signed, or
about loading non-secure content (the XTK source for the 3D models),
and you'll *probably* be able to crash the view or get a "Server
incorrectly configured" error. It's still pretty early! But I'd like
to request that you try hammering on this, try uploading data, and I
would really appreciate any help with design, coding, new widgets,
etc. The source is at http://bitbucket.org/MatthewTurk/yt.hub/ . As
you'll find, it's still a bit hacked together, but I am interested in
continuing to refine it, make it look better and nicer, and to ensure
that it's maintainable. (If you are interested in this, fork away!)
Any feedback would be *greatly* appreciated. I'm pretty excited about
using this for collaboration and data sharing! Down the road I could
see adding on more functionality like synchronized views, annotations
(halos, points of interest, etc), 3D volumes (for phase plots) and on
and on and on. In fact, this could be a way to share results with
collaborators from a running simulation.
Better than feedback, though, would be if you tried it out -- and
uploaded some data!
PS The variable mesh maps should work on phones and tablets. :)
I am trying to improve the efficiency of my analysis script which
calculates attributes of haloes, after watching the workshop video on YT
parallelism I was motivated to give parallel_objects a try. I am
basically trying to calculate, then output some properties of each haloes
found by parallel HOP. It turns out that even if I just output the (DM
particles) mass of each halo, I am missing halo(s). It doesn't matter if I
run this in serial or parallel, I end up missing the same amount of haloes
if I use parallel_objects() like:
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
to iterate over the haloes, and the problem goes away if I just switch to:
for halo in haloes:
I noticed this when I tried it on an 800 cube dataset with around 50k
haloes, I only get 4k haloes in return, I then tried to narrow things down,
and it ruled out the way I am calculating the attributes, because I can
just output the mass from halo.total_mass() that was basically read in from
the .h5 file and I'd end up missing halo using the parallel_objects. For
128 cube dataset with 85 haloes, I'd end up missing 3 and get 82 back, and
for 64 cube dataset with 22 haloes, I'd get back 21 haloes.
Has anyone else encountered this behavior or can confirm it?
after chatting with Britton on IRC a few days ago, I pushed some
changes that keeps the SQLite I/O on the root task only. Previously
only the O was on the root task, but all tasks did the I. This change
was done to hopefully A) speed things up with fewer tasks reading off
disk and B) reduce memory usage with fopen()s and such. In my limited
testing I saw a small increase in speed on 26 data dumps (something
like 3m50s to 3m35s) excluding/precomputing the halo finding step. But
this was on a machine with a good disk and there was no chance of
running out of memory.
The point of this email is as follows. After Britton had his problems,
I re-acquainted myself with the merger tree code, and I realized there
is a bit of a problem with the way it works. In brief, in order to
reduce the amount of SQLite interaction on disk, which is slow, the
results of the merger tree (namely the halo->halo relationships) are
not written to disk until the very end. It's kept in memory up that
point. This means that if the merger tree process is killed before the
information is saved to disk, everything is lost.
As I see it, there are a couple solutions to this.
1. When the halo relationships are saved, what actually happens is the
existing halo database is read in, and a new one is written out, and
in the process the just computed halo relationships are inserted into
the new database. This is done because SELECT (on old) and then INSERT
(on new) is magnitudes times faster than UPDATE (on old) on databases.
I could change things such that this process is done after every new
set of halo relationships is found between pairs of data dumps. Then,
if the merger tree is killed prior to completion, not all work is
2. Add a TTL parameter to MergerTree(). When the runtime of the merger
tree approaches this number, it will stop what it's doing, and write
out what it has so far.
In both cases, restarts would just check to see what work has been
done already, and continue on from there.
For those of you who care, which do you think is a better solution? #1
is a bit less work for a user, but #2 is likely faster by some
510.621.3687 (google voice)
Hey everyone (but specifically John Wise),
I just pulled to the newest version of yt, and I got significantly different
behavior from some old scripts of mine which use yt than I did prior to the
pull. I bisected the changesets to the one making the big difference for
me, and it was your commit: 71c2bfaa3b5f from late January.
Specifically, I'm doing some off axis projections which use the volume
rendering interface and ghost zones. I homogenize my volume, then take a
snapshot of the interior. When I do this with my old code, it takes about
20 seconds to partition and homogenize the volume, but now it takes over 10
minutes to partition the volume (I actually kill it at 10 minutes), and
it uses so much memory that all of my other programs grind to a halt on a
relatively big memory machine.
I was wondering if you (or anyone else on the list) encountered any problems
in volume rendering with ghost zones since this changeset (i.e. the last two
months), or if you can think of any obvious reasons why there would be such
a slow down? I know you're out of town right now, but I figured I'd send
this to the list anyway to see if anyone else had experienced slowdowns.
Thanks for the help!
Hi all, I was trying out pobj_demo.py from the workshop to see if I can
parallelize my halo analysis script, but I ran into the problem where I see:
>mpirun -n 2 python pobj_demo.py --parallel
gives a warning at the end
yt : [WARNING ] 2012-03-16 16:25:05,445 parallel_objects() is being used
when parallel_capable is false. The loop is not being run in parallel. This
may not be what was expected.
I've done a
yt instinfo -u and got the latest tip:
052fac826701 (yt) tip
but the problem persists. I've tried printing inside the parallel loops
for sto, sp in parallel_objects(spheres, num_procs, storage = my_storage):
print ytcfg.getint("yt", "__global_parallel_rank")
sto.result = sp.quantities['TotalQuantity']('CellMass')
sto.result_id = '%4e %4e %4e' % (sp.center,
and I always get "0", but 20 of them, so I'm guessing that confirms the
loop isn't running in parallel, and just runs the 10 spheres serially on 2
processors, so I get double the results.
Am I doing something wrong or missed something?