Extracting words from a string : *fast*

Doug Fort dougfort at downright.com
Tue Jun 19 13:57:46 EDT 2001


Thomas Weholt wrote:

> Hi,
> 
> I need to extract words from a string. This method will be used extensivly
> in a indexer so it needs to be as fast as possible.
> 
> It needs to split words by case, numbers, spaces and chars like ,.-_/\*'
> etc. Returns a list of lower-case entries of the words found or a
> dictionary of were the words are keys and number of occurences are values.
> 
> Ex.
> 
> s = 'This is a.test for ThomasWeholt - magic42'
> print getWords(s)
> -----------------------------------------------------
> ['this','is','a','test','for','thomas','weholt','magic','magic42']
> 
> The text to be processed are mostly small in size but can also be huge,
> etc. 1-10MB.
> 
> Thomas
> 
> 
> 
Have you seen Dr. David Mertz's article on developing a Python indexer?  He 
addresses the problem of identifying words, check out 
http://gnosis.cx/publish/programming/charming_python_15.txt
-- 
Doug Fort <dougfort at downright.com>
Senior Meat Manager
Downright Software LLC
http://www.downright.com

______________________________________________________________________
Posted Via Uncensored-News.Com - Still Only $9.95 - http://www.uncensored-news.com
   With Seven Servers In California And Texas - The Worlds Uncensored News Source
  



More information about the Python-list mailing list