Levenshtein word comparison -performance issue
S.Selvam Siva
s.selvamsiva at gmail.com
Fri Feb 20 02:03:35 EST 2009
On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <__peter__ at web.de> wrote:
> Gabriel Genellina wrote:
>
> > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <
> s.selvamsiva at gmail.com>
> > escribió:
> >
> >> I need some help.
> >> I tried to find top n(eg. 5) similar words for a given word, from a
> >> dictionary of 50,000 words.
> >> I used python-levenshtein module,and sample code is as follow.
> >>
> >> def foo(searchword):
> >> disdict={}
> >> for word in self.dictionary-words:
> >> distance=Levenshtein.ratio(searchword,word)
> >> disdict[word]=distance
> >> """
> >> sort the disdict dictionary by values in descending order
> >> """
> >> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)
> >>
> >> return similarwords[:5]
> >
> > You may replace the last steps (sort + slice top 5) by heapq.nlargest -
> at
> > least you won't waste time sorting 49995 irrelevant words...
> > Anyway you should measure the time taken by the first part (Levenshtein),
> > it may be the most demanding. I think there is a C extension for this,
> > should be much faster than pure Python calculations.
> >
>
> [I didn't see the original post]
>
> You can use the distance instead of the ratio and put the words into bins
> of
> the same length. Then if you find enough words with a distance <= 1 in the
> bin with the same length as the search word you can stop looking.
>
> You might be able to generalize this approach to other properties that are
> fast to calculate and guarantee a minimum distance, e. g. set(word).
>
> Peter
> --
> http://mail.python.org/mailman/listinfo/python-list
>
Thank you all for your response,
[sorry,I was away for a while.]
I used functools,heapq modules but that helped me a little,
then i categorized the words depending on the length and
compares with a small set(each set 50000/4=12,500),
so now its taking quarter of time as compared to older method.
Further, can i use Thread to achieve parallel comparison ?,as i have little
knowledge on python-thread.
Will the following code achive parallelism?
thread1= threading.Thread(target=self.findsimilar,
args=("1",searchword,dic-word-set1)
thread2= threading.Thread(target=self.findsimilar,
args=("2",searchword,dic-word-set1)
thread3= threading.Thread(target=self.findsimilar,
args=("3",searchword,dic-word-set1)
thread1.start()
thread2.start()
thread3.start()
thread1.join()
thread2.join()
thread3.join()
I would like to hear suggestion.
Note:The issue is i am developing spell checker for my local languge,i may
use more than 2.5 lakh words,so i need to have a best way to find out
alternative wordlist
--
Yours,
S.Selvam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090220/ddf4b473/attachment-0001.html>
More information about the Python-list
mailing list