Best way to structure data for efficient searching

Peter Otten __peter__ at web.de
Mon Apr 2 17:32:45 EDT 2012


Larry.Martell at gmail.com wrote:

> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

Build a lookup table that maps (K3, K4) to the matching records in the 
second table. Then you can walk through the first table, look up the 
matching records in the second and filter by the timestamp constraint:

from collections import defaultdict, namedtuple

# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
 ["alpha", "beta", 10],
 ["gamma", "delta", 20],
 ["gamma", "delta", 25],
 ["gamma", "delta", 40],
 ["kappa", "lambda", 40],
]
one = (One(*row) for row in one)

two = [
 ["alpha", "beta", "epsilon", 10],
 ["gamma", "delta", "zeta", 20],
 ["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)

# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
    lookup[rtwo.K3, rtwo.K4].append(rtwo)

# show the matches
for rone in one:
    for rtwo in lookup.get((rone.K1, rone.K2), ()):
        if abs(rone.T-rtwo.T) <= 20:
            print rone, "-->", rtwo

(Personally I'd go for the SQL approach proposed by Jon; I find the argument 
why you can't do it unconvincing)





More information about the Python-list mailing list