string parsing / regexp question
Paul McGuire
ptmcg at austin.rr.com
Wed Nov 28 14:23:44 EST 2007
On Nov 28, 11:32 am, "Ryan Krauss" <ryanli... at gmail.com> wrote:
> I need to parse the following string:
>
> $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
> }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
>
> The first thing I need to do is extract the arguments to \pmatrix{ }
> on both the left and right hand sides of the equal sign, so that the
> first argument is extracted as
>
> {\it x_2}\cr 0\cr 1\cr
>
> and the second is
>
> \left({{{\it m_2}\,s^2
> }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
>
> The trick is that there are extra curly braces inside the \pmatrix{ }
> strings and I don't know how to write a regexp that would count the
> number of open and close curly braces and make sure they match, so
> that it can find the correct ending curly brace.
>
As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:
from pyparsing import *
data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""
PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res
This prints the following:
['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']
Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:
['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']
Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)
You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res
Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]
But both of these methods will break if you decide to extend the
grammar with additional or optional fields.
A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"
Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":
res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args
Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.
If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:
nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"
res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args
Now this program returns the original text for the nested brace
expressions:
['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']
You can find more info on pyparsing at http://pyparsing.wikispaces.com.
Cheers!
-- Paul
More information about the Python-list
mailing list