efficient interval containment lookup

brent bpederse at gmail.com
Tue Jan 13 00:24:45 EST 2009


On Jan 12, 8:55 pm, Per Freem <perfr... at yahoo.com> wrote:
> On Jan 12, 10:58 pm, Steven D'Aprano
>
>
>
> <ste... at REMOVE.THIS.cybersource.com.au> wrote:
> > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > > thanks for your replies -- a few clarifications and questions. the
> > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> > >>= c and b <= d. Note that I am not looking for intervals that
> > > overlap... this is why interval trees seem to me to not be relevant, as
> > > the overlapping interval problem is way harder than what I am trying to
> > > do. Please correct me if I'm wrong on this...
>
> > To test for contained intervals:
> > a >= c and b <= d
>
> > To test for overlapping intervals:
>
> > not (b < c or a > d)
>
> > Not exactly what I would call "way harder".
>
> > --
> > Steven
>
> hi Steven,
>
> i found an implementation (which is exactly how i'd write it based on
> the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
>
> when i use this however, it comes out either significantly slower or
> equal to a naive search. my naive search just iterates through a
> smallish list of intervals and for each one says whether they overlap
> with each of a large set of intervals.
>
> here is the exact code i used to make the comparison, plus the code at
> the link i have above:
>
> class Interval():
>     def __init__(self, start, stop):
>         self.start = start
>         self.stop = stop
>
> import random
> import time
> num_ints = 30000
> init_intervals = []
> for n in range(0,
> num_ints):
>     start = int(round(random.random()
> *1000))
>     end = start + int(round(random.random()*500+1))
>     init_intervals.append(Interval(start, end))
> num_ranges = 900
> ranges = []
> for n in range(0, num_ranges):
>   start = int(round(random.random()
> *1000))
>   end = start + int(round(random.random()*500+1))
>   ranges.append((start, end))
> #print init_intervals
> tree = IntervalTree(init_intervals)
> t1 = time.time()
> for r in ranges:
>   tree.find(r[0], r[1])
> t2 = time.time()
> print "interval tree: %.3f" %((t2-t1)*1000.0)
> t1 = time.time()
> for r in ranges:
>   naive_find(init_intervals, r[0], r[1])
> t2 = time.time()
> print "brute force: %.3f" %((t2-t1)*1000.0)
>
> on one run, i get:
> interval tree: 8584.682
> brute force: 8201.644
>
> is there anything wrong with this implementation? it seems very right
> to me but i am no expert. any help on this would be relly helpful.

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:
http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/operations/quicksect.py
for a more robust implementation.
-brentp



More information about the Python-list mailing list