efficient interval containment lookup

Tue Jan 13 00:34:42 EST 2009

hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link.  could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

On Jan 13, 12:24 am, brent <bpede... at gmail.com> wrote:
> On Jan 12, 8:55 pm, Per Freem <perfr... at yahoo.com> wrote:
>
>
>
> > On Jan 12, 10:58 pm, Steven D'Aprano
>
> > <ste... at REMOVE.THIS.cybersource.com.au> wrote:
> > > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > > > thanks for your replies -- a few clarifications and questions. the
> > > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> > > >>= c and b <= d. Note that I am not looking for intervals that
> > > > overlap... this is why interval trees seem to me to not be relevant, as
> > > > the overlapping interval problem is way harder than what I am trying to
> > > > do. Please correct me if I'm wrong on this...
>
> > > To test for contained intervals:
> > > a >= c and b <= d
>
> > > To test for overlapping intervals:
>
> > > not (b < c or a > d)
>
> > > Not exactly what I would call "way harder".
>
> > > --
> > > Steven
>
> > hi Steven,
>
> > i found an implementation (which is exactly how i'd write it based on
> > the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
>
> > when i use this however, it comes out either significantly slower or
> > equal to a naive search. my naive search just iterates through a
> > smallish list of intervals and for each one says whether they overlap
> > with each of a large set of intervals.
>
> > here is the exact code i used to make the comparison, plus the code at
> > the link i have above:
>
> > class Interval():
> >     def __init__(self, start, stop):
> >         self.start = start
> >         self.stop = stop
>
> > import random
> > import time
> > num_ints = 30000
> > init_intervals = []
> > for n in range(0,
> > num_ints):
> >     start = int(round(random.random()
> > *1000))
> >     end = start + int(round(random.random()*500+1))
> >     init_intervals.append(Interval(start, end))
> > num_ranges = 900
> > ranges = []
> > for n in range(0, num_ranges):
> >   start = int(round(random.random()
> > *1000))
> >   end = start + int(round(random.random()*500+1))
> >   ranges.append((start, end))
> > #print init_intervals
> > tree = IntervalTree(init_intervals)
> > t1 = time.time()
> > for r in ranges:
> >   tree.find(r[0], r[1])
> > t2 = time.time()
> > print "interval tree: %.3f" %((t2-t1)*1000.0)
> > t1 = time.time()
> > for r in ranges:
> >   naive_find(init_intervals, r[0], r[1])
> > t2 = time.time()
> > print "brute force: %.3f" %((t2-t1)*1000.0)
>
> > on one run, i get:
> > interval tree: 8584.682
> > brute force: 8201.644
>
> > is there anything wrong with this implementation? it seems very right
> > to me but i am no expert. any help on this would be relly helpful.
>
> hi, the tree is inefficient when the interval is large. as the size of
> the interval shrinks to much less than the expanse of the tree, the
> tree will be faster. changing 500 to 50 in both cases in your script,
> i get:
> interval tree: 3233.404
> brute force: 9807.787
>
> so the tree will work for limited cases. but it's quite simple. check
> the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
> for a more robust implementation.
> -brentp