Extracting words from a string : *fast*

Thomas Weholt thomas at gatsoft.no
Tue Jun 19 05:25:16 EDT 2001


Hi,

I need to extract words from a string. This method will be used extensivly
in a indexer so it needs to be as fast as possible.

It needs to split words by case, numbers, spaces and chars like ,.-_/\*'
etc. Returns a list of lower-case entries of the words found or a dictionary
of were the words are keys and number of occurences are values.

Ex.

s = 'This is a.test for ThomasWeholt - magic42'
print getWords(s)
-----------------------------------------------------
['this','is','a','test','for','thomas','weholt','magic','magic42']

The text to be processed are mostly small in size but can also be huge, etc.
1-10MB.

Thomas





More information about the Python-list mailing list