Folks, I would like to test the integrity of a large dataset, for example by finding the max value of some data field. The dataset is too large to be fully loaded into memory, so I would like to read it slice by slice and compute max in each slice, and then combine them together. To that end I wrote the following script: import yt, yt.funcs import sys import numpy as np f = sys.argv[1] ds = yt.load(f) n = ds.domain_dimensions[0] dmax = np.zeros((n,)) for i in range(n): d = ds.r[i:(i+1),:,:][('artio', 'HVAR_GAS_DENSITY')] dmax[i] = np.amax(d) print(i,dmax[i],yt.funcs.get_memory_usage()) del d print(np.amax(dmax)) However, when I run it, I get: yt : [INFO ] 2018-08-28 10:52:58,075 Created 4945 chunks for ARTIO 0 16945762.0 202.5546875 yt : [INFO ] 2018-08-28 10:52:59,388 Created 1232 chunks for ARTIO 1 2416576.25 221.36328125 yt : [INFO ] 2018-08-28 10:53:01,599 Created 9635 chunks for ARTIO 2 10419311.0 269.06640625 ... yt : [INFO ] 2018-08-28 10:53:23,397 Created 1474 chunks for ARTIO 13 2395590.5 698.91015625 yt : [INFO ] 2018-08-28 10:53:25,594 Created 9747 chunks for ARTIO 14 11139424.0 739.16015625 The number of chunks created varies in each iteration of the loop, but the total memory usage is still steadily climbing up. Could you advise what I am doing wrong? Many thanks, Nick
Hi Nick, Sorry this took me so long to reply to. I don't know *precisely* what is going on. However, I can tell you some caveats of how the ARTIO frontend works that may be relevant. According to my recollection, there are a few things that come into play -- I will also note that Doug Rudd was the original author of most of this, so if I paraphrase or misstate something, it is not out of an intention of misrepresenting. * The ARTIO frontend is *very* neutral with respect to which data exists where. As you probably know, it computes the SFC values that intersect with a selector, it tries to optimize the ordering and collection of these based on what it knows about their distribution across files. * The ARTIO frontend *internally* manages much of the data reading and selection, inside _artio_reader. What this practically means is that there may be memory allocations (we have attempted to be very judicious in our deallocations) that persist as we traverse. If I am remembering correctly, once a file that holds an SFC range has been identified, it allocates enough memory to store (internal to the artio reader) the sub-chunks that live there, then you can iterate into the SFC value for oct values internal to it. I believe these should be freed when the SFC is released, but it's possible that there is a leak there. * Another possibility, which may not pan out, is that the memory usage precedes any internal garbage collection that Python does. There may be different results if you import gc at the top and manually call gc.collect() at the end of each loop. One additional thing I would check, which may result in more careful internal-to-yt allocation, is how this proceeds: with yt.memory_checker(5): dd = ds.all_data() dmax = dd.max( ("artio", "HVAR_GAS_DENSITY" ) ) This will set up a memory checker at 5 second intervals that shows the results of the dd.max() function. dd.max should do an IO-aware iteration, which *should* iterate in such a way that no more than one file is open at a time -- it should order the SFCs to minimize this. I hope this helps. -Matt On Tue, Aug 28, 2018 at 11:02 AM Nick Gnedin <ngnedin@gmail.com> wrote:
Folks,
I would like to test the integrity of a large dataset, for example by finding the max value of some data field. The dataset is too large to be fully loaded into memory, so I would like to read it slice by slice and compute max in each slice, and then combine them together. To that end I wrote the following script:
import yt, yt.funcs import sys import numpy as np f = sys.argv[1] ds = yt.load(f) n = ds.domain_dimensions[0] dmax = np.zeros((n,)) for i in range(n): d = ds.r[i:(i+1),:,:][('artio', 'HVAR_GAS_DENSITY')] dmax[i] = np.amax(d) print(i,dmax[i],yt.funcs.get_memory_usage()) del d print(np.amax(dmax))
However, when I run it, I get:
yt : [INFO ] 2018-08-28 10:52:58,075 Created 4945 chunks for ARTIO 0 16945762.0 202.5546875 yt : [INFO ] 2018-08-28 10:52:59,388 Created 1232 chunks for ARTIO 1 2416576.25 221.36328125 yt : [INFO ] 2018-08-28 10:53:01,599 Created 9635 chunks for ARTIO 2 10419311.0 269.06640625 ... yt : [INFO ] 2018-08-28 10:53:23,397 Created 1474 chunks for ARTIO 13 2395590.5 698.91015625 yt : [INFO ] 2018-08-28 10:53:25,594 Created 9747 chunks for ARTIO 14 11139424.0 739.16015625
The number of chunks created varies in each iteration of the loop, but the total memory usage is still steadily climbing up.
Could you advise what I am doing wrong?
Many thanks,
Nick _______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
participants (2)
-
Matthew Turk
-
Nick Gnedin