efficient interval containment lookup
Tim Chase
python.list at tim.thechases.com
Mon Jan 12 15:43:57 EST 2009
> suppose I have two lists of intervals, one significantly larger than
> the other.
> For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
> thousands
> of elements while listB (of the same form) might contain hundreds of
> thousands
> or millions of elements.
> I want to count how many intervals in listB are contained within every
> listA. For example, if listA = [(10, 30), (600, 800)] and listB =
> [(20, 25), (12, 18)] is the input, then the output should be that (10,
> 30) has 2 intervals from listB contained within it, while (600, 800)
> has 0. (Elements of listB can be contained within many intervals in
> listA, not just one.)
>
> What is an efficient way to this? One simple way is:
>
> for a_range in listA:
> for b_range in listB:
> is_within(b_range, a_range):
> # accumulate a counter here
Could you detail the is_within() function? most importantly, is
it inclusive, such that "a1 <= b1 <= b2 <= a2", or is it
overlapping such that "a1 <= b2 or a2 <= b1". Additionally, do
you want to count how many intervals of A overlap with intervals
of B, or do you just want a count of how many intervals in B have
*any* overlap within A?
My leaning would be to make a custom "is this in A" function,
iterate over B and test against an "appropriate subset" of A:
listA = sorted([...])
min_a1 = min(a1 for (a1, a2) in listA)
max_a2 = max(a2 for (a1, a2) in listA)
def in_a(b1, b2):
for a1, a2 in smartly_chosen_subset(listA, b1, b2):
if is_within((b1, b2), (a1, a2)): return True
return False
i = 0
for b1, b2 in sorted(listB):
if b2 < min_a1: continue
if b1 > max_a2: break
if in_a(b1, b2): i += 1
The in_a() function can be optimized to terminate early if a1>b2
or a2 < b1 (perhaps even choosing an smart starting point for
iterating). Short-circuiting by returning True as soon as you
know there's a match will trim some of the time.
How distributed are the endpoints? If there's a lot of
commonality in the data, you might be able to cache results to
cut even further.
Just a few ideas, and questions that might help develop better
solutions.
-tkc
More information about the Python-list
mailing list