Folks, I would like to test the integrity of a large dataset, for example by finding the max value of some data field. The dataset is too large to be fully loaded into memory, so I would like to read it slice by slice and compute max in each slice, and then combine them together. To that end I wrote the following script: import yt, yt.funcs import sys import numpy as np f = sys.argv[1] ds = yt.load(f) n = ds.domain_dimensions[0] dmax = np.zeros((n,)) for i in range(n): d = ds.r[i:(i+1),:,:][('artio', 'HVAR_GAS_DENSITY')] dmax[i] = np.amax(d) print(i,dmax[i],yt.funcs.get_memory_usage()) del d print(np.amax(dmax)) However, when I run it, I get: yt : [INFO ] 2018-08-28 10:52:58,075 Created 4945 chunks for ARTIO 0 16945762.0 202.5546875 yt : [INFO ] 2018-08-28 10:52:59,388 Created 1232 chunks for ARTIO 1 2416576.25 221.36328125 yt : [INFO ] 2018-08-28 10:53:01,599 Created 9635 chunks for ARTIO 2 10419311.0 269.06640625 ... yt : [INFO ] 2018-08-28 10:53:23,397 Created 1474 chunks for ARTIO 13 2395590.5 698.91015625 yt : [INFO ] 2018-08-28 10:53:25,594 Created 9747 chunks for ARTIO 14 11139424.0 739.16015625 The number of chunks created varies in each iteration of the loop, but the total memory usage is still steadily climbing up. Could you advise what I am doing wrong? Many thanks, Nick
Hi Nick,
Sorry this took me so long to reply to.
I don't know *precisely* what is going on. However, I can tell you
some caveats of how the ARTIO frontend works that may be relevant.
According to my recollection, there are a few things that come into
play -- I will also note that Doug Rudd was the original author of
most of this, so if I paraphrase or misstate something, it is not out
of an intention of misrepresenting.
* The ARTIO frontend is *very* neutral with respect to which data
exists where. As you probably know, it computes the SFC values that
intersect with a selector, it tries to optimize the ordering and
collection of these based on what it knows about their distribution
across files.
* The ARTIO frontend *internally* manages much of the data reading
and selection, inside _artio_reader. What this practically means is
that there may be memory allocations (we have attempted to be very
judicious in our deallocations) that persist as we traverse. If I am
remembering correctly, once a file that holds an SFC range has been
identified, it allocates enough memory to store (internal to the artio
reader) the sub-chunks that live there, then you can iterate into the
SFC value for oct values internal to it. I believe these should be
freed when the SFC is released, but it's possible that there is a leak
there.
* Another possibility, which may not pan out, is that the memory
usage precedes any internal garbage collection that Python does.
There may be different results if you import gc at the top and
manually call gc.collect() at the end of each loop.
One additional thing I would check, which may result in more careful
internal-to-yt allocation, is how this proceeds:
with yt.memory_checker(5):
dd = ds.all_data()
dmax = dd.max( ("artio", "HVAR_GAS_DENSITY" ) )
This will set up a memory checker at 5 second intervals that shows the
results of the dd.max() function. dd.max should do an IO-aware
iteration, which *should* iterate in such a way that no more than one
file is open at a time -- it should order the SFCs to minimize this.
I hope this helps.
-Matt
On Tue, Aug 28, 2018 at 11:02 AM Nick Gnedin
Folks,
I would like to test the integrity of a large dataset, for example by finding the max value of some data field. The dataset is too large to be fully loaded into memory, so I would like to read it slice by slice and compute max in each slice, and then combine them together. To that end I wrote the following script:
import yt, yt.funcs import sys import numpy as np f = sys.argv[1] ds = yt.load(f) n = ds.domain_dimensions[0] dmax = np.zeros((n,)) for i in range(n): d = ds.r[i:(i+1),:,:][('artio', 'HVAR_GAS_DENSITY')] dmax[i] = np.amax(d) print(i,dmax[i],yt.funcs.get_memory_usage()) del d print(np.amax(dmax))
However, when I run it, I get:
yt : [INFO ] 2018-08-28 10:52:58,075 Created 4945 chunks for ARTIO 0 16945762.0 202.5546875 yt : [INFO ] 2018-08-28 10:52:59,388 Created 1232 chunks for ARTIO 1 2416576.25 221.36328125 yt : [INFO ] 2018-08-28 10:53:01,599 Created 9635 chunks for ARTIO 2 10419311.0 269.06640625 ... yt : [INFO ] 2018-08-28 10:53:23,397 Created 1474 chunks for ARTIO 13 2395590.5 698.91015625 yt : [INFO ] 2018-08-28 10:53:25,594 Created 9747 chunks for ARTIO 14 11139424.0 739.16015625
The number of chunks created varies in each iteration of the loop, but the total memory usage is still steadily climbing up.
Could you advise what I am doing wrong?
Many thanks,
Nick _______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
participants (2)
-
Matthew Turk
-
Nick Gnedin