<br><br><div class="gmail_quote">On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <span dir="ltr"><__<a href="mailto:peter__@web.de">peter__@web.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div class="Ih2E3d">Gabriel Genellina wrote:<br>


<br>


> En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva <<a href="mailto:s.selvamsiva@gmail.com">s.selvamsiva@gmail.com</a>><br>


> escribió:<br>


><br>


</div><div class="Ih2E3d">>> I need some help.<br>


>> I tried to find top n(eg. 5) similar words for a given word, from a<br>


>> dictionary of 50,000 words.<br>


>> I used python-levenshtein module,and sample code is as follow.<br>


>><br>


>> def foo(searchword):<br>


>>     disdict={}<br>


>>     for word in self.dictionary-words:<br>


>>                    distance=Levenshtein.ratio(searchword,word)<br>


>>                    disdict[word]=distance<br>


>>     """<br>


>>      sort the disdict dictionary by values in descending order<br>


>>     """<br>


>>     similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)<br>


>><br>


>>     return similarwords[:5]<br>


><br>


</div><div class="Ih2E3d">> You may replace the last steps (sort + slice top 5) by heapq.nlargest - at<br>


> least you won't waste time sorting 49995 irrelevant words...<br>


> Anyway you should measure the time taken by the first part (Levenshtein),<br>


> it may be the most demanding. I think there is a C extension for this,<br>


> should be much faster than pure Python calculations.<br>


><br>


<br>


</div>[I didn't see the original post]<br>


<br>


You can use the distance instead of the ratio and put the words into bins of<br>


the same length. Then if you find enough words with a distance <= 1 in the<br>


bin with the same length as the search word you can stop looking.<br>


<br>


You might be able to generalize this approach to other properties that are<br>


fast to calculate and guarantee a minimum distance, e. g. set(word).<br>


<font color="#888888"><br>


Peter<br>


</font><div><div></div><div class="Wj3C7c">--<br>


<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br>


</div></div></blockquote></div><br><br>Thank you all for your response,<br><br>[sorry,I was away for a while.]<br>I used functools,heapq modules but that helped me a little,<br>then i categorized the words depending on the length and <br>


compares with a small set(each set 50000/4=12,500),<br>so now its taking quarter of time as compared to older method.<br><br>Further, can i use Thread to achieve parallel comparison ?,as i have little knowledge on python-thread.<br>


Will the following code achive parallelism?<br><br>thread1= threading.Thread(target=self.findsimilar, args=("1",searchword,dic-word-set1)<br>                   thread2= threading.Thread(target=self.findsimilar, args=("2",searchword,dic-word-set1)<br>


                   thread3= threading.Thread(target=self.findsimilar, args=("3",searchword,dic-word-set1)<br>                   thread1.start()<br>                   thread2.start()<br>                   thread3.start()<br>


                   thread1.join()<br>                   thread2.join()<br>                   thread3.join()<br><br>I would like to hear suggestion.<br clear="all">Note:The issue is i am developing spell checker for my local languge,i may use more than 2.5 lakh words,so i need to have a best way to find out alternative wordlist<br>


-- <br>Yours,<br>S.Selvam<br>