Freeze problem with Regular Expression
John Machin
sjmachin at lexicon.net
Wed Jun 25 18:29:38 EDT 2008
On Jun 26, 1:20 am, Kirk <nore... at yahoo.com> wrote:
> Hi All,
> the following regular expression matching seems to enter in a infinite
> loop:
>
> ################
> import re
> text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
> una '
> re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
> *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
> #################
>
[expletives deleted]
>
> I've python 2.5.2 on Ubuntu 8.04.
> any idea?
Several problems:
(1) lose the vertical bars (as advised by others)
(2) ALWAYS use a raw string for regexes; your \s* will match on lower-
case 's', not on spaces
(3) why are you using findall on a pattern that ends in "$"?
(4) using non-verbose regexes of that length means you haven't got a
petrol drum's hope in hell of understanding what's going on
(5) too many variable-length patterns, will take a finite (but very
long) time to evaluate
(6) as remarked by others, you haven't said what you are trying to do;
what it actually is doing doesn't look sensible (see below).
Following code is after fixing problems 1,2,3,4:
C:\junk>type infinitere.py
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
regex0 = r"""
[^A-Z0-9]* # match leading space
(
(?:
[0-9]* # match nothing
[A-Z]+ # match "MSX"
[0-9a-z\-]* # match nothing
)+ # match "MSX"
\s* # match " "
[a-z]* # match nothing
\s* # match nothing
(?:
[0-9]*
[A-Z]+
[0-9a-z\-]*
\s*
)* # match "INTERNATIONAL HOLDINGS ITALIA "
)
([^A-Z]*) # match "srl (di sequito "
"""
regex1 = regex0 + "$"
for rxno, rx in enumerate([regex0, regex1]):
mobj = re.compile(rx, re.VERBOSE).match(text)
if mobj:
print rxno, mobj.groups()
else:
print rxno, "failed"
C:\junk>infinitere.py
0 ('MSX INTERNATIONAL HOLDINGS ITALIA ', 'srl (di seguito ')
### taking a long time, interrupted
HTH,
John
More information about the Python-list
mailing list