Hi to all! For reverse engineering purposes, I need to find where every possible chunk of bytes in file A is contained in file B. Obviously, if a chunk of length n is contained in B, I dont' want my script to recognize also all the subchunks of size < n contained in the chunk. I coded a naive implementation of a similarity algorithm: scan the similarity algorithm and find all the diagonals. Here's the code: file_a = open(argv[1]) file_b = open(argv[2]) a = numpy.fromstring(file_a.read(),'c') b = numpy.fromstring(file_b.read(),'c') tolerance = int(argv[3]) chunks = [] valid = True count_xcent = 0 for x in xrange(len(a)): for y in xrange(len(b)): count = 0 if (a[x] == b[y]): x_cnt, y_cnt = x,y if (a[x_cnt] == b[y_cnt]): try: while (a[x_cnt+1] == b[y_cnt+1]): count += 1 x_cnt += 1 y_cnt += 1 except IndexError: pass if ((count > tolerance) or (count == tolerance)): for tuple in chunks: if (((x >= tuple[0]) and (x_cnt <= tuple[1])) and ((y >= tuple[2]) and (y_cnt <= tuple[3]))): valid = False if __debug__: print "Found an already processed subchunk" break if (valid): chunks.append ((x, x_cnt, y, y_cnt)) print "Corresponding chunk found. List 1: from " + str(x) + " to " + str(x_cnt) +". List 2: from " + str(y) + " to " + str(y_cnt) print "with length of " + str (x_cnt + 1 - x) else: valid = True It simply scans the similarity matrix, finds the diagonals in wich a[x] == a[y] and interprets the diagonal as a chunk. Then, it stores the chunk in a list and determines if it's a subchunk of another greater chunk already found. The problem is: this implementation is very slow, imho because of three factors: 1. I use a nested for loop to scan the matrix 2. When the start of a diagonal is found, the program scans the diagonal with another additional loop. Maybe it would be faster to use a function such as diagonal() (but I can't actually _create_ the boolean similarity matrix, as it gets way too big for files big enough (we're talking about ~ 100 Kb files) - I am forced to compute it "on the way". 3. When I find a chunk, I compute all its subchunks with this approach, and compare them with the "big" chunk I stored in the list. Maybe there's a better algorithmical way. What do you think about these issues? Is there a way to optimize them? And are there other issues I didn't take in account? Thanks in advance, regards, Chris.