Problems with re
Tim Peters
tim_one at email.msn.com
Sat May 22 03:53:13 EDT 1999
[Berthold Höllmann]
> I have a regular expression wich, im most cases does what I want it to
> do. But at least on one string it get's into an endless loop (OK I din't
> wait forever). See the attaced example:
>
> Python 1.5.2 (#2, Apr 22 1999, 14:34:42) [GCC egcs-2.91.66 19990314
> (egcs-1.1.2 on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import re
> >>> RE = re.escape
> >>> CP = r'\CallPython'
> >>> loopR = re.compile(
> ... "(?:" + RE(CP) + r'(\[.*\])?{(?P<CodeC>(?:' + '""".*?"""|".*?"'
> + "|'''.*?'''|'.*?'|" +
> ... '{.*?}+?|[^{]+?)+?))}',
> ... re.MULTILINE|re.DOTALL)
> >>>
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX(dir(math))}")
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX({1:1,2:2,3:3}")
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX(dir(math));
> LaTeXPy.PyLaTeX({1:1,2:2,3:3}")
>
> I try to parse a LaTeX file for the included "\CallPython" statements to
> extract python commands from this statement.
>
> Do you have any idea?
Oh, several -- but you're not going to like them <wink>.
+ Regular expressions aren't powerful enough to match nested brackets. So a
regexp approach to this problem is at best a sometimes-screws-up hack, no
matter how much more time you pour into it.
+ If you have to use regexps, at least use re.VERBOSE to make the mess more
readable; e.g.,
loopR = re.compile(r"""
(?: \\CallPython
(\[.*\])?
{
(?P<CodeC>
(?: \""".*?\"""
| ".*?"
| '''.*?'''
| '.*?'
| {.*?}+?
| [^{]+?
)+?
)
)
}
""", re.MULTILINE | re.DOTALL | re.VERBOSE)
This makes modification enormously easier, and makes some obscurities
obvious; e.g., by inspection, the outermost (?: ... ) serves no purpose so
can be removed.
+ You've got nested catch-almost-anything iterators, which can lead to
exponential match time. That's what your "endless loop" is all about. See
Friedl's "Master Regular Expressions" for details.
+ Most of the *pieces* of this regexp don't actually match what you want
them to match, except when you're lucky (which you often will be -- but
sometimes won't be). See Friedl.
+ You're better off assuming the LaTeX is correct. What would it really
hurt if your second example matched? Throw caution to the wind and try this
instead:
loopR = re.compile(r"""
\\CallPython
(\[.*?\])? # note that I made the guts a minimal match here
{
(?P<CodeC>
.*? # anything
)
}\s*$ # until finding a right brace at the end of a line
""", re.MULTILINE | re.DOTALL | re.VERBOSE)
This will screw up too, but the conditions under which it will are now
obvious <wink> and so easy to avoid; won't ever consume exponential time,
either.
if-a-regexp-is-longer-than-that-one-it's-wrong<0.9-wink>-ly y'rs - tim
More information about the Python-list
mailing list