[Numpy-discussion] Optimizing similarity matrix algorithm

12 Sep 2007

      Hi to all!
For reverse engineering purposes, I need to find where every possible chunk
of bytes in file A is contained in file B. Obviously, if a chunk of length n
is contained in B, I dont' want my script to recognize also all the
subchunks of size < n contained in the chunk.

I coded a naive implementation of a similarity algorithm: scan the
similarity algorithm and find all the diagonals. Here's the code:

file_a = open(argv[1])
file_b = open(argv[2])

a = numpy.fromstring(file_a.read(),'c')
b = numpy.fromstring(file_b.read(),'c')

tolerance = int(argv[3])

chunks = []
valid = True

count_xcent = 0
for x in xrange(len(a)):
    for y in xrange(len(b)):
        count = 0
        if (a[x] == b[y]):
            x_cnt, y_cnt = x,y
            if (a[x_cnt] == b[y_cnt]):
                try:
                    while (a[x_cnt+1] == b[y_cnt+1]):
                        count += 1
                        x_cnt += 1
                        y_cnt += 1
                except IndexError:
                    pass
                if ((count > tolerance) or (count == tolerance)):
                    for tuple in chunks:
                        if (((x >= tuple[0]) and (x_cnt <= tuple[1])) and
((y >= tuple[2]) and (y_cnt <= tuple[3]))):
                            valid = False
                            if __debug__:
                                print "Found an already processed subchunk"
                            break
                    if (valid):
                        chunks.append ((x, x_cnt, y, y_cnt))
                        print "Corresponding chunk found. List 1: from " +
str(x) + " to " + str(x_cnt) +". List 2: from " + str(y) + " to " +
str(y_cnt)
                        print "with length of " + str (x_cnt + 1 - x)
                    else:
                        valid = True

It simply scans the similarity matrix, finds the diagonals in wich a[x] ==
a[y] and interprets the diagonal as a chunk. Then, it stores the chunk in a
list and determines if it's a subchunk of another greater chunk already
found.

The problem is: this implementation is very slow, imho because of three
factors:

1. I use a nested for loop to scan the matrix

2. When the start of a diagonal is found, the program scans the diagonal
with another additional loop. Maybe it would be faster to use a function
such as diagonal() (but I can't actually _create_ the boolean similarity
matrix, as it gets way too big for files big enough (we're talking about ~
100 Kb files) - I am forced to compute it "on the way".

3. When I find a chunk, I compute all its subchunks with this approach, and
compare them with the "big" chunk I stored in the list. Maybe there's a
better algorithmical way.

What do you think about these issues? Is there a way to optimize them? And
are there other issues I didn't take in account?

Thanks in advance,

regards,

Chris.

[Numpy-discussion] Optimizing similarity matrix algorithm

Kurdt Bane