Mailman 3 Looking for a way to cluster data - SciPy-User

Looking for a way to cluster data

older
Is numpy's argsort lying about its...

Gary Ruben

25 Apr 2009 25 Apr '09

10:18 p.m.

Hi all, I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates. I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation. scipy.cluster seems to be able to cluster the points but I'm not sure how to get the x,y,z coordinates of the original points out of its linkage data. This may not be possible. Maybe the scipy.spatial module is a better match to my problem. Any suggestions? Gary

Show replies by date

Zachary Pincus

26 Apr 26 Apr

6:51 p.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Gary,

...

I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates.

I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation.

It sounds like what you want from the clustering is to get groups of pixels that form straight-ish lines (and could thus be visualized with 3D rods). Is this correct? Most clustering algorithms are likely to just give you groups of pixels that are nearby spatially -- which is probably not exactly what you want, since if you (say) have a long rod- structure, you'd want all the pixels in that rod grouped together, and not grouped with separate rods that cross nearby in space but aren't physically connected. So if you do want to cluster the pixels, you'll need to use the geodesic distance between two pixels, not the euclidian distance. But that still wouldn't give you sets of rods... more like rods and blobs at the junctions. Another option would be to try to detect junction points on the skeleton (I think this is a common operation -- sometimes in 2D people brute-force it by looking for all possible patterns of pixels indicative of branching, but there's probably a nicer way, especially for 3d). Once the junctions are known, a simple flood-fill along the skeleton gives you sets of pixels that lie between each junction. Having such a connectivity graph is a useful thing -- lots of neat stuff one can calculate from that. In fact, I think you'd need to do this to calculate geodesic distances anyway... Anyhow, you could then try fitting the points from each branch to a poly-line (using some information criterion for determining how many lines to use), to simplify the representation down to rods for plotting. But perhaps it's worth figuring out if there's another visualization method you could use, before going to all this effort... Would volume rendering work? Perhaps you could go as far as junction-point-finding and branch-labeling. Just doing that would let you exclude short branches (or short terminal branches) and then you could simply volume- render the rest and not fiddle with line-fitting. Zach On Apr 25, 2009, at 11:18 PM, Gary Ruben wrote:

...

Hi all,

I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates.

I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation.

scipy.cluster seems to be able to cluster the points but I'm not sure how to get the x,y,z coordinates of the original points out of its linkage data. This may not be possible. Maybe the scipy.spatial module is a better match to my problem.

Any suggestions?

Gary _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user

Gary Ruben

27 Apr 27 Apr

4:56 a.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Zach, Many thanks for your thoughts. I've had a good go at this myself now and am getting some useful results, although the processing is a bit slow at the moment. The approach I took is as follows: Using scipy.ndimage I label the 26-connected regions so that I can process each connected branch-like structure in turn. For each of these, I form a graph with nodes corresponding to pixels and edges corresponding to the Minimum Spanning Tree weighted by euclidean distance. I then use NetworkX to traverse this graph, extracting the node visitation order as a single list. Whenever this list makes jumps of distance >sqrt(2), it is jumping to a different branch, so I identify these jumps and subdivide the list into a list of lists, each containing nodes containing only one branch. I then plot these using plot3d. Problems with this approach are that generating the Minimum Spanning Tree in pure Python is slow - It turns out for the example I'm trying that I have one large region containing 25000 pixels. For the moment I've ignored this region and am only dealing with the smaller region which are only up to a few hundred pixels. I may need to look at Cython for this if I pursue this approach. I've made a few in-line responses to your comments. Zachary Pincus wrote:

...

Hi Gary, <snip> It sounds like what you want from the clustering is to get groups of pixels that form straight-ish lines (and could thus be visualized with 3D rods). Is this correct?

Actually, I probably want curved lines so your later comments and suggestions about fitting poly-lines are what I may try next.

...

Most clustering algorithms are likely to just give you groups of pixels that are nearby spatially -- which is probably not exactly what you want, since if you (say) have a long rod- structure, you'd want all the pixels in that rod grouped together, and not grouped with separate rods that cross nearby in space but aren't physically connected. So if you do want to cluster the pixels, you'll need to use the geodesic distance between two pixels, not the euclidian distance. But that still wouldn't give you sets of rods... more like rods and blobs at the junctions.

I'm not sure what you mean by geodesic distance here. It's true that I have unwanted connections due to the proximity of regions to one another, but without some very fancy code to detect these cases, I think I'll be satisfied with my rough approximate method. I know one way of doing this is to look at the behaviour of tangent vectors to the curves but this is impractical in pure Python and probably not worth the effort for my purposes.

...

Another option would be to try to detect junction points on the skeleton (I think this is a common operation -- sometimes in 2D people brute-force it by looking for all possible patterns of pixels indicative of branching, but there's probably a nicer way, especially for 3d). Once the junctions are known, a simple flood-fill along the skeleton gives you sets of pixels that lie between each junction.

Having such a connectivity graph is a useful thing -- lots of neat stuff one can calculate from that. In fact, I think you'd need to do this to calculate geodesic distances anyway... Anyhow, you could then try fitting the points from each branch to a poly-line (using some information criterion for determining how many lines to use), to simplify the representation down to rods for plotting.

I'm planning to do this next. I'm hoping I can think of a very simple criterion.

...

But perhaps it's worth figuring out if there's another visualization method you could use, before going to all this effort... Would volume rendering work?

It may. I haven't tried volume rendering yet, but rendering tubes is the ideal. I have tried just isosurfacing the detected pixels and this isn't so bad, but tubes give the eye extra depth cues that I'm hoping to exploit to make things look more informative.

...

Perhaps you could go as far as junction-point-finding and branch-labeling. Just doing that would let you exclude short branches (or short terminal branches) and then you could simply volume- render the rest and not fiddle with line-fitting.

I'll think about this too, but I may try straight volume rendering first and see how it looks.

...

Zach

Thanks for taking the time to respond. regards, Gary

Zachary Pincus

11:32 a.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Gary,

...

Using scipy.ndimage I label the 26-connected regions so that I can process each connected branch-like structure in turn. For each of these, I form a graph with nodes corresponding to pixels and edges corresponding to the Minimum Spanning Tree weighted by euclidean distance. I then use NetworkX to traverse this graph, extracting the node visitation order as a single list. Whenever this list makes jumps of distance >sqrt(2), it is jumping to a different branch, so I identify these jumps and subdivide the list into a list of lists, each containing nodes containing only one branch. I then plot these using plot3d.

This sounds like a pretty good approach. Unfortunately, as you've recognized, any operations that loop over pixels in pure python are pretty much too slow to use for large images. I definitely think cython might be a good thing to look into. Another possibility, if you were coding this in c or cython, would be to just do a flood-fill directly. (You could do the recursive algorithm, or just use a stack. Either way it's simple.) Every time you encounter a branch-point, increment a counter and use that new value for the "flood-fill" color. Then you have a labeled image where each branch has a given value. Conveniently, ndimage has lots of useful tools for dealing with such label images. You could also build up the connectivity graph during the flood-fill (nodes = branch points, edge weights = the number of pixels along each branch), for calculating the MST.

...

...
It sounds like what you want from the clustering is to get groups of pixels that form straight-ish lines (and could thus be visualized with 3D rods). Is this correct?

Actually, I probably want curved lines so your later comments and suggestions about fitting poly-lines are what I may try next.

I have another idea that might be good: take the x,y,z coords of each pixel along each branch, and fit a parametric smoothing spline through them with scipy.interpolate.splprep (use a non-zero / non-None smoothing parameter 's'). Then to render the curves, just evaluate the spline (scipy.interpolate.splev) at however many internal points you want, to give a poly-line. Use more points to get better resolution... I do this a lot when I need to "post-filter" binarized image data into smooth continuous arcs.

...

...
Most clustering algorithms are likely to just give you groups of pixels that are nearby spatially -- which is probably not exactly what you want, since if you (say) have a long rod- structure, you'd want all the pixels in that rod grouped together, and not grouped with separate rods that cross nearby in space but aren't physically connected. So if you do want to cluster the pixels, you'll need to use the geodesic distance between two pixels, not the euclidian distance. But that still wouldn't give you sets of rods... more like rods and blobs at the junctions.

I'm not sure what you mean by geodesic distance here.

The geodesic distance between two pixels would be the shortest path *along the binarzied skeleton* between those pixels. As opposed to the shortest straight-line between the points. This would be relevant, say, if you have two branches that are nearby in space, but rather far from one another in terms of connectivity -- then using a euclidian distance to cluster pixels might put neaby pixels from the two different branches in the same cluster.

...

...
Having such a connectivity graph is a useful thing -- lots of neat stuff one can calculate from that. In fact, I think you'd need to do this to calculate geodesic distances anyway... Anyhow, you could then try fitting the points from each branch to a poly-line (using some information criterion for determining how many lines to use), to simplify the representation down to rods for plotting.

I'm planning to do this next. I'm hoping I can think of a very simple criterion.

Check out the standard, if maybe hokey, information criteria for model selection, like AIC and BIC. They offer semi-principled ways to trade off goodness-of-fit vs. model parsimony. http://en.wikipedia.org/wiki/Akaike_information_criterion and related pages. Zach

Anne Archibald

1:51 p.m.

New subject: [SciPy-user] Looking for a way to cluster data

2009/4/25 Gary Ruben <gruben@bigpond.net.au>:

...

Hi all,

I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates.

I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation.

scipy.cluster seems to be able to cluster the points but I'm not sure how to get the x,y,z coordinates of the original points out of its linkage data. This may not be possible. Maybe the scipy.spatial module is a better match to my problem.

Any suggestions?

If I understand you correctly, what you have is not a list of coordinates of freely-located points, it's a binary mask indicating which voxels are part of your object. So first of all, have you considered using volumetric visualization tools? These seem like they might be a better fit to your problem. If what you want to know about is the connectivity of your object, though, I can see why you might want to build chains of rods. The most direct approach is, for each cell that is a 1, to draw rods from it to each of its neighbors that is on. This may not give you what you want: if you have regions where all the cells are on, they'll be a dense grid of rods. It will also not allow you to provide long strings of rods to your 3D toolkit, or to eliminate short chains. As I see it, then, your problem is graph-theoretic: you have this fairly dense adjacency graph of "on" cells, and you want to pare it down. One good choice would be to produce a (minimum diameter?) spanning tree, which should be pretty easy to convert to a collection of strings of rods. But I think what you want is a graph library of some sort. On the other hand, if what you have is "fat" chains of on cells, and you want to build up a "skeleton" of them (like converting the pixels of an image of the letter o back to a circle), you might look at machine vision for help, they do this sort of thing often. Anne

Gary Ruben

28 Apr 28 Apr

7:36 a.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Anne, <snip>

...

If I understand you correctly, what you have is not a list of coordinates of freely-located points, it's a binary mask indicating which voxels are part of your object. So first of all, have you considered using volumetric visualization tools? These seem like they might be a better fit to your problem.

I haven't tried this yet, but I will - Zach suggested it too elsewhere in this thread.

...

If what you want to know about is the connectivity of your object, though, I can see why you might want to build chains of rods. The most direct approach is, for each cell that is a 1, to draw rods from it to each of its neighbors that is on. This may not give you what you want: if you have regions where all the cells are on, they'll be a dense grid of rods. It will also not allow you to provide long strings of rods to your 3D toolkit, or to eliminate short chains.

As I see it, then, your problem is graph-theoretic: you have this fairly dense adjacency graph of "on" cells, and you want to pare it down. One good choice would be to produce a (minimum diameter?) spanning tree, which should be pretty easy to convert to a collection of strings of rods. But I think what you want is a graph library of some sort.

I may have (pre-emptively) read your mind because that's basically the approach I took, but I'm not really interested directly in connectivity information, just in ordering the voxel coordinates so I can draw the lines as tubes. I used NetworkX but I've built the Minimum Spanning Tree myself because NetworkX needs the edge information to create its graphs.

...

On the other hand, if what you have is "fat" chains of on cells, and you want to build up a "skeleton" of them (like converting the pixels of an image of the letter o back to a circle), you might look at machine vision for help, they do this sort of thing often.

Anne

Thanks - that's not my problem - my chains are all lines of single-voxel thickness to start. Thanks for the suggestions. I've got a workable approach now that needs some tweaking but basically works. Volumetric visualisation may do the trick too and will be simple for me to try thanks to the wonderful mayavi. regards, Gary

Damian Eads

12 May 12 May

8:38 p.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Gary, On Sat, Apr 25, 2009 at 8:18 PM, Gary Ruben <gruben@bigpond.net.au> wrote:

...

Hi all,

I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates.

With the dendrogram function, the order nodes appear from left-to-right can be change with the distance_sort or count_sort functions.

...

I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation.

scipy.cluster seems to be able to cluster the points but I'm not sure how to get the x,y,z coordinates of the original points out of its linkage data. This may not be possible.

The rows of the linkage matrix are the clusters and the first two columns of the linkage matrix are the indices of the left and right node, respectively. If the index is less than the number of points clustered (i < N), it's a leaf node (original point/singleton cluster), otherwise it's a non-singleton cluster (i >= N). Note, that there are always (N-1) non-singleton clusters, so the linkage matrix will always have N-1 rows.

...

Maybe the scipy.spatial module is a better match to my problem.

I haven't had the chance to read this part of the discussion but I hope my answer to your question helps. Cheers, Damian ----------------------------------------------------- Damian Eads Ph.D. Candidate Jack Baskin School of Engineering, UCSC E2-489 1156 High Street Machine Learning Lab Santa Cruz, CA 95064 http://www.soe.ucsc.edu/~eads

Gary Ruben

15 May 15 May

9:10 p.m.

New subject: [SciPy-user] Looking for a way to cluster data

Hi Damian, Thanks for taking the time to reply. I ended up with a solution for now that doesn't use scipy.cluster and I won't have the time to revisit this, but I think that with the information you provided, I could probably have used the dendrogram function and not taken a graph-theory approach. Gary Damian Eads wrote:

...

Hi Gary,

On Sat, Apr 25, 2009 at 8:18 PM, Gary Ruben <gruben@bigpond.net.au> wrote:

...
Hi all,

I'm looking for some advice on how to order data points so that I can visualise them. I've been looking at scipy.cluster for this purpose but I'm not sure whether it is suitable so I thought I'd see whether anyone had suggestions for a simpler suggestion of how to order the coordinates.

With the dendrogram function, the order nodes appear from left-to-right can be change with the distance_sort or count_sort functions.

...
I have a binary 3D array containing 1's that form a shape in a 3D volume against a background of 0's - they form a skeleton of a connected, branched structure. Furthermore, the points are all 26-connected to each other, i.e. there are no gaps in the skeleton. The longest chains may be 1000's of points long. It would be nice to visualise these using the mayavi mlab plot3d function, which draws tubes and which requires ordered coordinates as input, so I need to get ordered coordinate lists that traverse the points along the branches of the skeleton. It would also be nice to preferentially cluster long chains since then I can cull very short chains from the visualisation.

scipy.cluster seems to be able to cluster the points but I'm not sure how to get the x,y,z coordinates of the original points out of its linkage data. This may not be possible.

The rows of the linkage matrix are the clusters and the first two columns of the linkage matrix are the indices of the left and right node, respectively. If the index is less than the number of points clustered (i < N), it's a leaf node (original point/singleton cluster), otherwise it's a non-singleton cluster (i >= N). Note, that there are always (N-1) non-singleton clusters, so the linkage matrix will always have N-1 rows.

...
Maybe the scipy.spatial module is a better match to my problem.

I haven't had the chance to read this part of the discussion but I hope my answer to your question helps.

Cheers,

Damian

5527

Age (days ago)

5547

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Anne Archibald
Damian Eads
Gary Ruben
Zachary Pincus

Looking for a way to cluster data

Gary Ruben

Zachary Pincus

Gary Ruben

Zachary Pincus

Anne Archibald

Gary Ruben

Damian Eads

Gary Ruben

tags

participants (4)