Regular Expressions: large amount of or's
Kent Johnson
kent37 at tds.net
Tue Mar 1 15:04:44 EST 2005
André Søreng wrote:
>
> Hi!
>
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?
If you can split some_string into individual words, you could look them up in a set of known words:
known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]
Kent
>
> André
>
More information about the Python-list
mailing list