Best match searching

Brian Kelley bkelley at wi.mit.edu
Tue Dec 30 18:17:16 CET 2003

Daniel Pryde wrote:
> The array is implemented simply using a list of lists. The only solution
> I could think of is to use some for statements, but that seems to take a
> while, even for just one search.
Ahh this brings back memories of my image processing days :)

I've implemented a couple of approaches.
1) using just one list to hold the data
2) using a list of lists
3) using numeric

I've optimized the search function for 1 and 2 a little bit by defining
the scoring function on the fly like this and binding input during the
function definition:

input = 0.5
def euclidean(x, input=input): return abs(x-input)

We can also use a bounding box technique such that if a minimum x is
found and the x < input then we can reject all points less than x
without having to perform the euclidean distance.  Conversely if x >
input then we can reject all points greater than x.  This really only
works if x is a single value and not a vector.

Don't forget that when looking for minimums using a euclidean distance
you don't have to perform the square root.  (x1*x1+x2*x2) will find the
correct minimum as well as opposed to the more time consuming
math.sqrt(x1*x1+x2*x2)

Here are the results looking for data closest to an input value of 0.5
using a pretty loaded pentium 4 2ghz.  The result is for a single search
against a random array of 600x400.

time           row col value
0.130000114441 396 328 0.500001351171 single list
0.119999885559 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

So numeric is the clear winner here.  However, when psyco is used
http://psyco.sourceforge.net/

time            row col value
0.0400000810623 396 328 0.500001351171 single list
0.0499999523163 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

psyco actually wins here but more interesting is that the timing order
is reversed!  I'll have to do a lot more trials to generate good
profiling, but I think the results are pretty good here.

Here is the test code, feel free to rip it apart as you will.
import Numeric, random, time

# create data
data = [random.random() for x in xrange(600*400)]

# test list of list
lists = []
for i in range(400): # rows
lists.append(data[i*600:(i+1)*600]) # columns

matrix = Numeric.array(lists)

def testarray(data=data):
t1 = time.time()
# straight forward approach
mn = 10e99
index = None
lx = -10e99 # keep track of the best found
hx = 10e99  # approaching from the bottom and top

input = 0.5
def euclidean(x, input=input):
return abs(x-input)

for i,x in enumerate(data):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = i

t2 = time.time()
row = index/600
col = index%600
print t2-t1, row, col, data[index], "single list"

def testlists(list=list):
mn = 10e99
index = None
input = 0.5
lx = -10e99 # keep track of the best found
hx = 10e99  # approaching from the bottom and top

def euclidean(x, input=input):
return abs(x-input)
t1 = time.time()
for r, row in enumerate(lists):
for c, x in enumerate(row):
if x < lx or x > hx: # can we reject this point?
continue
y = euclidean(x)
if y < mn:
if x < input:
lx = x
else:
hx = x
mn = y
index = r,c

t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "list of lists"

def testmatrix(matrix=matrix):
t1 = time.time()
mins = abs(matrix-0.5)
mn = 10e99
index = None
for r,c in enumerate(Numeric.argmin(mins)):
y = mins[r,c]
if y < mn:
mn = y
index = r,c
t2 = time.time()
r,c = index
print t2-t1, r,c, lists[r][c], "matrix"

# list of lists approach
testarray()
testlists()
testmatrix()

print "*"*44
print "try with psyco"
try:
import psyco
psyco.full()
testarray()
testlists()
testmatrix()
except ImportError:
print "psyco not available"