[BangPypers] How to compare the relevancy between news headlines?
devjyoti patra
djpatra at gmail.com
Tue Jun 14 12:15:27 CEST 2011
Hi Gopalakrishnan,
You can follow the following algorithm for clustering the news items:
1. Tokenize the headline.
2. Remove the stop words from the headlines, i.e., words like "a",
"an", "is", "the", etc.
3. Generate shingles from the remaining words.
ex, 4-shignles for watches generates the following.
['wa', 'wat', 'watc']
['at', 'atc', 'atch']
['tc', 'tch', 'tche']
['ch', 'che', 'ches']
['he', 'hes']
['es'']
4. Calculate Jaccard similarity between each pair of headlines. This
will result in a "n*n" matrix for "n" news headlines.
Jaccard similarity = (number of common singles between HEAD_a and
HEAD_b) / (number of unique singles in HEAD_a and HEAD_b combined)
5. Cluster the headlines constrained by a parameterized
MIN_SIMILARITY_THRESHOLD.
Regards,
Devjyoti
More information about the BangPypers
mailing list