Is there a maximum length of a regular expression in python?
Steve Holden
steve at holdenweb.com
Wed Jan 18 09:07:21 EST 2006
olekristianvillabo at gmail.com wrote:
> I have a regular expression that is approximately 100k bytes. (It is
> basically a list of all known norwegian postal numbers and the
> corresponding place with | in between. I know this is not the intended
> use for regular expressions, but it should nonetheless work.
>
> the pattern is
> ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
> SVOLVÆR)'
>
> The error message I get is:
> RuntimeError: internal error in regular expression engine
>
And I'm not the least bit surprised. Your code is brittle (i.e. likely
to break) and cannot, for example, cope with multiple spaces between the
number and the word(s). Quite apart from breaking the interpreter :-)
I'd say your test was the clearest possible demonstration that there
*is* a limit.
Wouldn't it be better to have a dict keyed on the number and containing
the word (which you can construct from the same source you constructed
your horrendously long regexp)?
Then if you find something matching the pattern (untested)
ur'(N-|NO-)?((\d\d\d\d)\s*([A-Za-z ]+))'
or something like it that actually works (I invariably get regexps wrong
at least three times before I get them right) you can use the dict to
validate the number and name.
Quite apart from anything else, if the text line you are examining
doesn't have the right syntactic form then you are going to test
hundreds of options, none of which can possibly match. So matching the
syntax and then validating the data identified seems like a much more
sensible option (to me, at least).
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
More information about the Python-list
mailing list