[Tutor] How to identify clusters of similar files
fomcl at yahoo.com
Sun Jun 3 12:37:06 CEST 2012
From: Steven D'Aprano <steve at pearwood.info>
>To: Python Mailing List <tutor at python.org>
>Sent: Sunday, June 3, 2012 4:00 AM
>Subject: Re: [Tutor] How to identify clusters of similar files
>Albert-Jan Roskam wrote:
>> I want to use difflib to compare a lot (tens of thousands) of text files. I
>> know that many files are quite similar as they are subsequent versions of
>> the same document (a primitive kind of version control). What would be a
>> good approach to cluster the files based on their likeness?
>You have already identified the basic tool: difflib. But your question is not really about Python, it is more about the algorithm used for clustering data according to goodness of fit. That's a hard problem, and you should consider asking it on the main Python mailing list or newsgroup too.
>Some search terms to get you started:
>nltk (the Natural Language Tool Kit)
>Thanks! Biopython looks very interesting. While browsing I was thinking this problem could also be considered as a probabilistic/fuzzy linkage problem (Fellegi & Sunter). Instead of linking records, I am trying to 'link' files.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor