Regular Expressions: large amount of or's
nick at craig-wood.com
Wed Mar 2 07:48:16 CET 2005
André Søreng <wsoereng at tiscali.no> wrote:
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
> With the re module, this can be solved something like this:
> import re
> r = re.compile("word1|word2|word3|.......|wordN")
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but
I wrote a regexp optimiser for exactly this case.
Eg a regexp for all 5 letter words starting with re
$ grep -c '^re' /usr/share/dict/words
$ grep '^re' /usr/share/dict/words | ./words-to-regexp.pl 5
As you can see its not perfect.
Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl
Yes its perl and rather cludgy but may give you ideas!
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
More information about the Python-list