Regular Expressions: large amount of or's
kent37 at tds.net
Tue Mar 1 21:04:44 CET 2005
André Søreng wrote:
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
> With the re module, this can be solved something like this:
> import re
> r = re.compile("word1|word2|word3|.......|wordN")
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
If you can split some_string into individual words, you could look them up in a set of known words:
known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]
More information about the Python-list