
On Fri, Feb 25, 2011 at 10:52:09AM +0100, Fred wrote:
What exactly do you mean by 'decimating'. To me is seems that you are looking for matrix factorization or matrix completion techniques, which are trendy topics in machine learning currently. By decimating, I mean this:
input array data.shape = (nx, ny, nz) -> data[::ax, ::ay, ::az], ie output array data[::ax, ::ay, ::az].shape = (nx/ax, ny/ay, nz/az).
OK, this can be seen as an interpolation on a grid with a nearest neighbor interpolator. What I am unsure about is whether you want to interpolate your NaN, or whether they just mean missing data. I would do this by representing the matrix as a sparse matrix in COO, this would give you a list of row and col positions for your data points. Then I would use a nearest neighbor (such as scipy's KDTree, or the scikit-learn's BallTree for even better performance http://scikit-learn.sourceforge.net/modules/neighbors.html) to find, for each grid point which data point is closest and fill in your grid. I suspect that your problem is that you can't fit the whole matrix in memory. If your data points are reasonnably homogeneously distributed in the matrix, I would simply process the problem using sub matrices, and making sure that I train the nearest neighbor on a sub matrix that is largest than the sampling grid by a factor of more than the inter-point distance. HTH, Gael