I have a dataset of N points (in 4 dimensions) and I'd like to select a smaller subset, size M, of those points that are maximally spread out. An approximation is fine. Other than the K-d tree, is there anything in SciPy or other Python module to help accomplish this? Thanks
On Wed, Jul 27, 2011 at 05:45:20PM -0400, Gustavo Goretkin wrote:
I have a dataset of N points (in 4 dimensions) and I'd like to select a smaller subset, size M, of those points that are maximally spread out.
The problem that you are trying to solve is close to the k-medoids problem. I don't know of Python modules implementing a k-medoids. Alternatively, the k_init function used to initialize the k-means in the scikits.learn [1] might be a useful approximation. It's a pretty brutal approximation, and it might not work for you, but it should be fast. Gaël [1] https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/clust...
Gustavo, I'd go with KDTree, fast and easy in 4d; if you build a tree with big leaves, leafsize ~ N/M, then take the midpoints of each leaf, that should do ? To walk the leaves, add the below to .../scipy/spatial/kdtree.py (or .pyx, but building the tree is not much slower in pure python). cheers -- denis def forleaves( self, func, *args, **kwargs ): """ call func( data ) for each leaf, e.g. leafmid = [] def leaffunc( data, leafmid=leafmid ): leafmid.append( data.mean(axis=0 )) """ q = [self.tree] while q: node = heappop(q) if isinstance( node, KDTree.leafnode ): data = self.data[node.idx] func( data, *args, **kwargs ) # test-leaves.py else: heappush( q, node.less ) heappush( q, node.greater ) On Jul 27, 11:45 pm, Gustavo Goretkin <gustavo.goret...@gmail.com> wrote:
I have a dataset of N points (in 4 dimensions) and I'd like to select a smaller subset, size M, of those points that are maximally spread out. An approximation is fine. Other than the K-d tree, is there anything in SciPy
participants (3)
-
denis
-
Gael Varoquaux
-
Gustavo Goretkin