[Tutor] How to identify clusters of similar files

Sat Jun 2 21:00:14 CEST 2012

Hi,

I want to use difflib to compare a lot (tens of thousands) of text files. I know that many files are quite similar as they are subsequent versions of the same document (a primitive kind of version control). What would be a good approach to cluster the files based on their likeness? I want to be able to say something like: the number of files could be reduced by a factor of ten when the number of (near-)duplicates is taken into account.

So let's say I have ten versions of a txt file: 'file0.txt', 'file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt', 'file6.txt', 'file7.txt', 'file8.txt', 'file9.txt'. How could I to some degree of certainty say they are related (I can't rely on the file names I'm affraid). file0 may be very similar to file1, but no longer to file10. But their likeness is "chained". The situation is easier with perfectly identical files.

The crude code below illustrates what I'd like to do, but it's too simplistic. I'd appreciate some thoughts or references to theoretical approaches to this kind of stuff.

import difflib, glob, os

path = "/home/aj/Destkop/someDir"
extension = ".txt"
cut_off = 0.95

allTheFiles = sorted(glob.glob(os.path.join(path, "*" + extension)))

for f_a in allTheFiles:
  for f_b in allTheFiles:
    file_a = open(f_a).readlines()
    file_b = open(f_b).readlines()
    if f_a != f_b:

       likeness = difflib.SequenceMatcher(lambda x: x == " ", file_a, file_b).ratio()
       if likeness >= cut_off:
         try:
           clusters[f_a].append(f_b)
         except KeyError:
           clusters[f_a] = [f_b]

Thank you in advance!

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120602/2c83d8e9/attachment.html>