[Tutor] Splitting strings into blocks

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon May 1 08:11:03 CEST 2006



On Sun, 30 Apr 2006, Daniel Watkins wrote:

> I'm currently working on a program to parse LaTeX style maths 
> expressions and provide an answer. For example, I have the expression 
> "2^\frac{1}{2}". I'm trying to work out a way to split this into it's 
> most basic blocks of LaTeX (i.e. 2^ and \frac{1}{2}) while maintaining a 
> record of the depth of the expression (i.e. (2^,0),(\frac{1}{2},1)).

Hi Daniel,

Forgive me for going off a side tangent here, but do we have to work 
directly on LaTeX?

I don't mean this to be a silly question!  What kind of problem are we 
trying to solve?  It sounds like we're trying to evaluate LaTeX equations, 
so that presents the problem of parsing LaTeX.

Diagrammically:

     LaTeX source ---- parsing --> evaluated expressions

We could talk about parsing LaTeX.  But could we have equations written in 
something easier to parse, and generate LaTeX out of that?  That is:

     some easy-to-parse equations
                |
                +-- simple parser --> LaTeX writer ---> LaTeX source
                |
                +-- simple parser --> evaluator --> evaluated expressions


If we have the freedom to do this, we may want to take this route, because 
it frees us from being so tied to LaTeX.  For example, although this will 
look silly at first, we could do something like:

     text = """<para>
        Hello world.
        <math name="expr1">
            <fraction>
                <numer>3</numer>
                <denom>6</denom>
            </fraction>
        </math>
        equals:
        <evaluated ref="expr1"/>
        Does that look ok?
     </para>"""

This is just an ad-hoc XML document I've cooked up.  The idea is that, 
given something like the above, we can:

     * write a XML-processing program to turn this into legitimate
       LaTeX source, or

     * write a program to evaluate all the math expressions
       and give us their values.

If we take this route, the parsing problem here is already solved for us 
because there are XML-parsing modules like minidom or ElementTree.

     http://docs.python.org/lib/module-xml.dom.minidom.html
     http://effbot.org/zone/element-index.htm


For example, here's a quick and dirty one that doesn't do a perfect job, 
but it's a start.  (For this example, I'm using minidom because it handles 
text nodes as uniformly as other nodes, whereas ElementTree treats them as 
weird special cases.)

##################################################################
import xml.dom.minidom

def node_to_latex(node):
     """node_to_latex: node -> string
     Converts an arbitrary node to LaTeX source."""
     if node.nodeType == node.TEXT_NODE:
         return text_to_latex(node)
     elif node.tagName == 'para':
         return para_to_latex(node)
     elif node.tagName == 'math':
         return math_to_latex(node)
     elif node.tagName == 'evaluated':
         return evaluated_to_latex(node)


def text_to_latex(node):
     """text_to_latex: node -> string
     Assuming node is a text node, returns its textual data."""
     return node.data


def para_to_latex(para_node):
     """para_to_latex: para -> string
     Converts a paragraph node to LaTeX source."""
     results = []
     for sub_node in para_node.childNodes:
         results.append(node_to_latex(sub_node))
     results.append('\n')
     return ''.join(results)


def math_to_latex(math):
     """math_to_latex: node -> string
     Converts a math node to latex source."""
     results = ['\n\\begin{math}\n']
     for sub_math_node in math.childNodes:
         results.append(math_subnode_to_latex(sub_math_node))
     results.append('\n\\end{math}\n')
     return ''.join(results)


def math_subnode_to_latex(sub_node):
     """math_subnode_to_latex: node -> string
     Converst a math subnode into LaTeX.  At the moment,
     we handle only fractions and text."""
     results = []
     if sub_node.nodeType == sub_node.TEXT_NODE:
         results.append(sub_node.data)
     elif sub_node.tagName == 'fraction':
         results.append(fraction_to_latex(sub_node))
     return ''.join(results)


def fraction_to_latex(node):
     """fraction_to_latex: fraction -> string
     Converts fractions to LaTeX."""
     results = ['\\frac']
     results.append('{')
     for child_math in (
         node.getElementsByTagName('numer')[0].childNodes):
         results.append(math_subnode_to_latex(child_math))
     results.append('}')
     results.append('{')
     for child_math in (
         node.getElementsByTagName('denom')[0].childNodes):
         results.append(math_subnode_to_latex(child_math))
     results.append('}')
     return ''.join(results)


def evaluated_to_latex(node):
     """evaluated_to_latex: evaluated -> string
     Converts evaluated to LaTeX.  Doesn't quite work right yet."""
     results = []
     results.append("EVALUATED-NOT-HANDLED-YET")
     return ''.join(results)


if __name__ == '__main__':
     text = """<para>
        Hello world.
        <math name="expr1">
            <fraction>
                <numer>3</numer>
                <denom>6</denom>
            </fraction>
        </math>
        equals:
        <evaluated ref="expr1"/>
        Does that look ok?
     </para>"""
     tree = xml.dom.minidom.parseString(text)
     print node_to_latex(tree.documentElement)
##################################################################

My apologies for the length here, but I needed all those functions.  This 
does a recursive traversal through the structure of the XML tree.  It 
handles text, paragraphs, and very simple math.

The XML document is neutral in the ways we can interpret it.  Not only can 
we use it as a source for making LaTeX documents, but we can also extract 
all the math expressions in there by doing a similar kind of recursive 
traversal.

#####################################################################
def find_math(node):
     """find_math: node -> (listof node)
     Returns a list of all the math sub-nodes embedded in the node."""
     if node.nodeType == node.TEXT_NODE:
         return []
     elif node.tagName == 'math':
         return [node]
     else:
         results = []
         for child in node.childNodes:
             results.extend(find_math(child))
         return results
#####################################################################

Evaluating these math nodes should just be a matter of doing similar kinds 
of recursive traversals on their structure.


I'm sorry I went really fast on this, but does this make sense so far?


More information about the Tutor mailing list