[Numpy-discussion] Condensing array...

Gael Varoquaux gael.varoquaux at normalesup.org
Fri Feb 25 05:22:06 EST 2011

On Fri, Feb 25, 2011 at 10:52:09AM +0100, Fred wrote:
> > What exactly do you mean by 'decimating'. To me is seems that you are
> > looking for matrix factorization or matrix completion techniques, which
> > are trendy topics in machine learning currently.
> By decimating, I mean this:

> input array data.shape = (nx, ny, nz) -> data[::ax, ::ay, ::az], ie 
> output array data[::ax, ::ay, ::az].shape = (nx/ax, ny/ay, nz/az).

OK, this can be seen as an interpolation on a grid with a nearest
neighbor interpolator. What I am unsure about is whether you want to
interpolate your NaN, or whether they just mean missing data.

I would do this by representing the matrix as a sparse matrix in COO,
this would give you a list of row and col positions for your data points.
Then I would use a nearest neighbor (such as scipy's KDTree, or the
scikit-learn's BallTree for even better performance
http://scikit-learn.sourceforge.net/modules/neighbors.html) to find, for
each grid point which data point is closest and fill in your grid.

I suspect that your problem is that you can't fit the whole matrix in
memory. If your data points are reasonnably homogeneously distributed in
the matrix, I would simply process the problem using sub matrices, and
making sure that I train the nearest neighbor on a sub matrix that is
largest than the sampling grid by a factor of more than the inter-point



More information about the NumPy-Discussion mailing list