Regular Expressions: large amount of or's

Kent Johnson kent37 at tds.net
Tue Mar 1 21:04:44 CET 2005


André Søreng wrote:
> 
> Hi!
> 
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
> 
> With the re module, this can be solved something like this:
> 
> import re
> 
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
> 
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
> 
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?

If you can split some_string into individual words, you could look them up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]

Kent

> 
> André
> 



More information about the Python-list mailing list