[Tutor] How to identify clusters of similar files

Sun Jun 3 12:37:06 CEST 2012



From: Steven D'Aprano <steve at pearwood.info>
>To: Python Mailing List <tutor at python.org> 
>Sent: Sunday, June 3, 2012 4:00 AM
>Subject: Re: [Tutor] How to identify clusters of similar files
> 
>Albert-Jan Roskam wrote:
>> Hi,
>> 
>> I want to use difflib to compare a lot (tens of thousands) of text files. I
>> know that many files are quite similar as they are subsequent versions of
>> the same document (a primitive kind of version control). What would be a
>> good approach to cluster the files based on their likeness?
>
>You have already identified the basic tool: difflib. But your question is not really about Python, it is more about the algorithm used for clustering data according to goodness of fit. That's a hard problem, and you should consider asking it on the main Python mailing list or newsgroup too.
>
>Some search terms to get you started:
>
>biopython
>nltk  (the Natural Language Tool Kit)
>unrooted phylogram
>
>
>Good luck!
>
>
>-- Steven
>
>Hi Steven,
>
>Thanks! Biopython looks very interesting. While browsing I was thinking this problem could also be considered as a probabilistic/fuzzy linkage problem (Fellegi & Sunter). Instead of linking records, I am trying to 'link'  files.
>
>
>Best wishes,
>Albert-Jan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120603/1d7502fe/attachment.html>