Best way to structure data for efficient searching

Jon Clements joncle at googlemail.com
Wed Mar 28 21:52:12 CEST 2012


On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry.... at gmail.com  wrote:
> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested)

select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20

Failing that, two (default)dicts with a tuple as the pair, then use that as your base.

Jon.




More information about the Python-list mailing list