How to pick out the same titles.
duncan smith
duncan at invalid.invalid
Sun Oct 16 14:19:57 EDT 2016
On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
>
> I have a long text file that has movie titles in it and I would like
> to find dupes.
>
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields" Sometimes the title will
> have the date a year off.
>
> What I would like to do it output to another file that show those two
> as a match.
>
> I don't know the best way to tackle this. I would think you would
> have to pair the titles with the most consecutive letters in a row.
>
> Anyone want this as a practice exercise? I don't really use
> programming enough to remember how.
>
Tokenize, generate (token) set similarity scores and cluster on
similarity score.
>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
pad=True)
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7
You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).
def n_grams(s, n, pad=False):
# n >= 1
# returns a list of n-grams
# or an empty list if n > len(s)
if pad:
s = '_' * (n-1) + s + '_' * (n-1)
return [s[i:i+n] for i in range(len(s)-n+1)]
def Jaccard(tokens1, tokens2):
# returns exact Jaccard
# similarity measure for
# two token sets
tokens1 = set(tokens1)
tokens2 = set(tokens2)
return len(tokens1&tokens2) / len(tokens1|tokens2)
Duncan
More information about the Python-list
mailing list