Levenshtein word comparison -performance issue
Terry Reedy
tjreedy at udel.edu
Fri Feb 13 13:22:45 EST 2009
Gabriel Genellina wrote:
> En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva
> <s.selvamsiva at gmail.com> escribió:
>
>> I need some help.
>> I tried to find top n(eg. 5) similar words for a given word, from a
>> dictionary of 50,000 words.
>> I used python-levenshtein module,and sample code is as follow.
>>
>> def foo(searchword):
>> disdict={}
>> for word in self.dictionary-words:
>> distance=Levenshtein.ratio(searchword,word)
>> disdict[word]=distance
>> """
>> sort the disdict dictionary by values in descending order
>> """
>> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
>>
>> return similarwords[:5]
>
> You may replace the last steps (sort + slice top 5) by heapq.nlargest -
> at least you won't waste time sorting 49995 irrelevant words...
There is also no need to build the 50000 entry word-distance dictionary.
import heapq, functools
def foo(searchword, n):
distance = functools.partial(Levenshtein.ratio, searchword)
return heapq.nlargest(n, words, distance)
If the distances are wanted along with the similar words, I strongly
suspect that it would be faster to recalculate a small number than to
generate the dict of 50000 pairs.
> Anyway you should measure the time taken by the first part
> (Levenshtein), it may be the most demanding. I think there is a C
> extension for this, should be much faster than pure Python calculations.
And such could be dropped into the code above.
Terry Jan Reedy
More information about the Python-list
mailing list