Levenshtein word comparison -performance issue

S.Selvam Siva s.selvamsiva at gmail.com
Fri Feb 20 02:03:35 EST 2009


On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <__peter__ at web.de> wrote:

> Gabriel Genellina wrote:
>
> > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <
> s.selvamsiva at gmail.com>
> > escribió:
> >
> >> I need some help.
> >> I tried to find top n(eg. 5) similar words for a given word, from a
> >> dictionary of 50,000 words.
> >> I used python-levenshtein module,and sample code is as follow.
> >>
> >> def foo(searchword):
> >>     disdict={}
> >>     for word in self.dictionary-words:
> >>                    distance=Levenshtein.ratio(searchword,word)
> >>                    disdict[word]=distance
> >>     """
> >>      sort the disdict dictionary by values in descending order
> >>     """
> >>     similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
> >>
> >>     return similarwords[:5]
> >
> > You may replace the last steps (sort + slice top 5) by heapq.nlargest -
> at
> > least you won't waste time sorting 49995 irrelevant words...
> > Anyway you should measure the time taken by the first part (Levenshtein),
> > it may be the most demanding. I think there is a C extension for this,
> > should be much faster than pure Python calculations.
> >
>
> [I didn't see the original post]
>
> You can use the distance instead of the ratio and put the words into bins
> of
> the same length. Then if you find enough words with a distance <= 1 in the
> bin with the same length as the search word you can stop looking.
>
> You might be able to generalize this approach to other properties that are
> fast to calculate and guarantee a minimum distance, e. g. set(word).
>
> Peter
> --
> http://mail.python.org/mailman/listinfo/python-list
>


Thank you all for your response,

[sorry,I was away for a while.]
I used functools,heapq modules but that helped me a little,
then i categorized the words depending on the length and
compares with a small set(each set 50000/4=12,500),
so now its taking quarter of time as compared to older method.

Further, can i use Thread to achieve parallel comparison ?,as i have little
knowledge on python-thread.
Will the following code achive parallelism?

thread1= threading.Thread(target=self.findsimilar,
args=("1",searchword,dic-word-set1)
                   thread2= threading.Thread(target=self.findsimilar,
args=("2",searchword,dic-word-set1)
                   thread3= threading.Thread(target=self.findsimilar,
args=("3",searchword,dic-word-set1)
                   thread1.start()
                   thread2.start()
                   thread3.start()
                   thread1.join()
                   thread2.join()
                   thread3.join()

I would like to hear suggestion.
Note:The issue is i am developing spell checker for my local languge,i may
use more than 2.5 lakh words,so i need to have a best way to find out
alternative wordlist
-- 
Yours,
S.Selvam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090220/ddf4b473/attachment-0001.html>


More information about the Python-list mailing list