Regular Expressions: large amount of or's

James Stroud jstroud at mbi.ucla.edu
Tue Mar 1 15:06:39 EST 2005


This does not sound like a job for a single regex.

Using a list and listcomp (say your words are in a list called "mywordlist") 
you can make this quite terse. Of course I have a way of writing algorithms 
that have very large exp when people tell me the O(N^exp).

try this:


myregexlist = [re.compile(aword) for aword in mywordlist]
myoccurrences = [argx.findall(some_string) for argx in myregexlist]


Now you should have a 1:1 mapping of the mywordlist and myoccurrences. Of 
course you can fill mywordlist with real regular expressions instead of just 
words. If you want to count the words, you may just want to use the string 
count method:


myoccurrences = [some_string.count(aword) for aword in mywordlist]


This may make more sense if you are not using true regexes.

James

On Tuesday 01 March 2005 11:46 am, André Søreng wrote:
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
>
> André

-- 
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095



More information about the Python-list mailing list