[Numpy-discussion] Overlapping ranges

Mon Mar 16 17:22:22 EDT 2009

I'm trying to file a set of data points, defined by genome coordinates, into bins, also based on genome coordinates. Each data point is (chromosome, start, end, point) and each bin is (chromosome, start, end). I have about 140 million points to file into around 100,000 bins. Both are (roughly) evenly distributed over the 24 chromosomes (1-22, X and Y). Genome coordinates are integers and my data points are floats. For each data point, (end - start) is roughly 1000, but the bins are are of uneven widths. Bins might have also overlap - in that case, I need to know all the bins that a point overlaps.

By overlap, I mean the start or end of the data point (or both) is inside the bin or that the point entirely covers the bin.

At the moment, I'm using a fairly naive approach that finds roughly in the genome (which gene) each point might be and then checking it against the bins in that gene. If I split the problem into chromosomes, I feel sure there must be some super-fast matrix approach I can apply using numpy, but I'm struggling a bit. Can anybody suggest something?

Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090316/8aae76f3/attachment.html>