Filtering XArray Datasets?
Peter Otten
__peter__ at web.de
Tue Jun 7 09:42:31 EDT 2022
On 07/06/2022 00:28, Israel Brewster wrote:
> I have some large (>100GB) datasets loaded into memory in a two-dimensional (X and Y) NumPy array backed XArray dataset. At one point I want to filter the data using a boolean array created by performing a boolean operation on the dataset that is, I want to filter the dataset for all points with a longitude value greater than, say, 50 and less than 60, just to give an example (hopefully that all makes sense?).
>
> Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for example), and then applying that boolean array to the dataset using .where(), with drop=True. This appears to work, but has two issues:
>
> 1) It’s slow. On my large datasets, applying where can take several minutes (vs. just seconds to use a boolean array to index a similarly sized numpy array)
> 2) It uses large amounts of memory (which is REALLY a problem when the array is already using 100GB+)
>
> What it looks like is that values corresponding to True in the boolean array are copied to a new XArray object, thereby potentially doubling memory usage until it is complete, at which point the original object can be dropped, thereby freeing the memory.
>
> Is there any solution for these issues? Some way to do an in-place filtering?
Can XArray-s be sorted, resized in-place? If so, you can sort by
longitude <= 50, search the index of the first row with longitude <= 50
and then resize the array.
(If the order of rows matters the sort algorithme has to be stable)
More information about the Python-list
mailing list