Regular Expressions: large amount of or's

Nick Craig-Wood nick at
Wed Mar 2 07:48:16 CET 2005

André Søreng <wsoereng at> wrote:
>  Given a string, I want to find all ocurrences of
>  certain predefined words in that string. Problem is, the list of
>  words that should be detected can be in the order of thousands.
>  With the re module, this can be solved something like this:
>  import re
>  r = re.compile("word1|word2|word3|.......|wordN")
>  r.findall(some_string)
>  Unfortunately, when having more than about 10 000 words in
>  the regexp, I get a regular expression runtime error when
>  trying to execute the findall function (compile works fine, but
>  slow).

I wrote a regexp optimiser for exactly this case.

Eg a regexp for all 5 letter words starting with re

$ grep -c '^re' /usr/share/dict/words

$ grep '^re' /usr/share/dict/words  | ./ 5


As you can see its not perfect.

Find it in

Yes its perl and rather cludgy but may give you ideas!

Nick Craig-Wood <nick at> --

More information about the Python-list mailing list