Problems with re
Tim Peters
tim_one at email.msn.com
Sun May 23 03:01:51 EDT 1999
[Berthold Höllmann, with a long regexp that "takes forever" in some cases,
and is full of surprises anyway]
[Tim]
> + Regular expressions aren't powerful enough to match nested
> brackets. So a regexp approach to this problem is at best a
> sometimes-screws-up hack, no matter how much more time you pour
> into it.
[Berthold]
> What would be your recommendation instead?
Up to you! If you can live with errors but can't live with exponential
matching time, keep the regexp as simple as the alternative I posted at the
end.
If you can't live with errors, you're going to have to parse the Python "for
real"; regexps are great for lexical classification but hopeless for real
parsing. The easiest way to do that is to use a very simple regexp to find
\CallPython sites in the LaTex, then use Lib/tokenize.py to parse the Python
part. tokenize.py isn't particularly easy to learn how to use, but it's
bulletproof, and easier to learn than any of the general-purpose parsing
systems.
Looks like you want to find the first unmatched right brace. So your
tokeneater function can ignore everything except "{" and "}" tokens,
incrementing a depth counter by 1 when it sees the former and decrementing
when it sees the latter. If the depth counter is already 0 when it sees a
"}", you've found the closing LaTeX brace, or the Python is buggy. tokenize
will handle strings and continuation lines etc correctly without any work on
your part.
Or write a character-at-a-time loop yourself that skips over strings and
counts braces. It's much easier to write something like that than a hairy
regexp.
even-better-it-works-ly y'rs - tim
PS: If these are *your* LaTeX constructions you're trying to parse, how
about finessing the problem out of existence by defining trivial-to-find
beginPython/endPython pairs instead?
More information about the Python-list
mailing list