[BangPypers] How to compare the relevancy between news headlines?

devjyoti patra djpatra at gmail.com
Tue Jun 14 12:15:27 CEST 2011


Hi Gopalakrishnan,

You can follow the following algorithm for clustering the news items:

1. Tokenize the headline.
2. Remove the stop words from the headlines, i.e., words like "a",
"an", "is", "the", etc.
3. Generate shingles from the remaining words.
    ex, 4-shignles for watches generates the following.
    ['wa', 'wat', 'watc']
    ['at', 'atc', 'atch']
    ['tc', 'tch', 'tche']
    ['ch', 'che', 'ches']
    ['he', 'hes']
    ['es'']

4. Calculate Jaccard similarity between each pair of headlines. This
will result in a "n*n" matrix for "n" news headlines.
    Jaccard similarity = (number of common singles between HEAD_a and
HEAD_b) / (number of unique singles in HEAD_a and HEAD_b combined)

5. Cluster the headlines constrained by a parameterized
MIN_SIMILARITY_THRESHOLD.

Regards,
Devjyoti


More information about the BangPypers mailing list