[Tutor] newbie evalpostfix [writing a infix parser in Python]

Mon Dec 15 04:22:28 EST 2003

On Fri, 12 Dec 2003, j hansen wrote:

> You raise some good points.  I need the evalpostfix (from ch.18) to
> evaluate "5*4+300/(5-2)*(6+4)+4".  So far, I'm able to tokenize this
> expression and to convert it to a polish post order format.
> Unfortunately, the evalpostfix does not operate correctly because it
> uses "*" before "/".

Hello!

(Please keep tutor at python.org in CC; this makes sure that someone can
continue the discussion, even if I get negligent in checking my email...)

Both multiplication and division should have the same order of precedence.
Something like:

    "5*4+300/(5-2)*(6+4)+4"

should be transformed into:

    [+
        [+ [* 5 4]
            [* [/ 300
                  [- 5 2]]
               [+ 6 4]]]
        4]

in "prefix" notation.  I'm just using prefix notation because I'm more
used to it, but going from this "prefix" notation to "postfix" notation
just involves a little reordering of things.

>  Thus, it does not evaluate according to the order of precedence.

Can you show us how your program translates the original string?

Writing an infix parser is not exactly a trivial thing to do.  It's
actually a classic exercise for people who are learning how to build
programming language compilers!  But I'll sketch some of the steps that a
person would use to write this parser from scratch.

Actually, the easy approach would be to not try to write things from
scratch, but to use one of the automatic "parser-generator" modules.  For
example:

    http://theory.stanford.edu/~amitp/Yapps/

provides a tools called Yapps that automatically generates a program,
given a specification of the "grammar".  And Amit Patel already has a
parser for arithmetic expressions:

    http://theory.stanford.edu/~amitp/Yapps/expr.py

so you may just want to swipe his code.  *grin*

One of the first steps for parsing a string is to design a grammar for the
language that interests us.  For stuff like arithmetic expressions, we can
design a "grammar"  for a language of arithmetic.  It might look something
like this:

"""
    expression ::=  factor
                  | factor '+' factor
                  | factor '-' factor

    factor     ::=  term
                  | term '*' term
                  | term '/' term

    term       ::= NUMBER
                  | '(' expression ')'
"""

This grammar is still non-executable at the moment, but let's use it for a
moment.  In a "grammar", we define the names of "rules" on the left side
of the '::=' symbol, and their expansions on the right side.  In the above
grammar, we say that an 'expression' can be a single factor, or it can be
expanded out as two factors that are being added or subtracted.  For
example:

    5 * 6 + 7

can be seen as two 'factors' being added together.  The first factor
itself is a product of two 'terms'.  Let me draw it more graphically:

              expression
              /    |   \
             /     |    \
          factor  '+'   factor
          / | \            |
         /  |  \          term
     term  '*' term        |
      |         |       NUMBER(7)
  NUMBER(5)   NUMBER(6)

It's not obvious at first, but these rules are designed in a way so that
the 'precedence' of an expression is built in --- multiplication binds
more tightly because of the relationship between 'terms' and 'factors', so
something like '5 + 6 * 7' will have a parse tree that reflects the
precedences that we'd expect from multiplication and addition.

Another way to represent a parse tree like the above is with nested lists.
Something like:

    ['+', ['*', 5, 6], 7]

captures the structure of the parse tree above, if we assume that the
operator is the first element of each list, and the operands are the rest
of the list.

If we can take a string and break it down into a nested list structure
like this, then it becomes easier to generate a 'postfix' representation,
because all that's involved is a bit of list manipulation, although you
will probably need to write something recursive.

Anyway, for convenience, here's some very rough Python code that can
translate arithmetic strings into parsed expression lists.

###
"""A small parser for arithmetic expressions.  Demonstration of
recursive descent parsing.  Transforms expressions into prefix form,
using nested lists.  For example, we expect something like:

   "5*4+300/(5-2)*(6+4)+4"

to be parsed into the nested list:

['+', ['+', ['*', 5, 4], ['*', ['/', 300, ['-', 5, 2]], ['+', 6, 4]]], 4]

Each nested list represents a subexpression, and has three components:

    [operator, left_subexpression, right_subexpression]

For reference, the grammar that we parse is:

    expression ::=  factor
                  | factor '+' factor
                  | factor '-' factor

    factor     ::=  term
                  | term '*' term
                  | term '/' term

    term       ::= NUMBER
                  | '(' expression ')'

(We use a small trick with 'while' loops to avoid doing explicit
recursion.)
"""

"""Here are the symbolic tokens that we'll process:"""
PLUS = '+'
MINUS = '-'
TIMES = '*'
DIVIDE = '/'
LPAREN = '('
RPAREN = ')'
EOF = 'EOF'

def main():
    """Some test code."""
    ## test of parsing: "5*4+300/(5-2)*(6+4)+4"
    print ArithmeticParser([
        5, TIMES, 4, PLUS, 300, DIVIDE,
        LPAREN, 5, MINUS, 2, RPAREN,
        TIMES, LPAREN, 6, PLUS, 4, RPAREN,
        PLUS, 4, EOF]).parse_expression()

class ArithmeticParser:
    """Parses a token list and returns a nested expression structure in
    prefix form."""
    def __init__(self, tokens):
        self.tokens = tokens

    def peek_token(self):
        """Looks at the next token in our token stream."""
        return self.tokens[0]

    def consume_token(self):
        """Pops off the next token in our token stream."""
        next_token = self.tokens[0]
        del self.tokens[0]
        return next_token

    def parse_expression(self):
        """Returns a parsed expression.  An expression may consist of a
        bunch of chained factors."""
        expression = self.parse_factor()
        while True:
            if self.peek_token() in (PLUS, MINUS):
                operation = self.consume_token()
                factor = self.parse_factor()
                expression= [operation, expression, factor]
            else: break
        return expression

    def parse_factor(self):
        """Returns a parsed factor.   A factor may consist of a bunch of
        chained terms."""
        factor = self.parse_term()
        while True:
            if self.peek_token() in (TIMES, DIVIDE):
                operation = self.consume_token()
                term = self.parse_term()
                factor = [operation, factor, term]
            else: break
        return factor

    def parse_term(self):
        """Returns a parsed term.  A term may consist of a number, or a
        parenthesized expression."""
        if self.peek_token() != LPAREN:
            return int(self.consume_token())
        else:
            self.consume_token() ## eat up the lparen
            expression = self.parse_expression()
            self.consume_token() ## eat up the rparen
            return expression

if __name__ == '__main__':
    main()
###

The main() function shows a really minimal test of the system --- it
parses a list of tokens that represents the string
"5*4+300/(5-2)*(6+4)+4".

I took a lot of shortcuts here --- I'm not doing any error trapping here,
and I'm sure there's still some bugs in there somewhere.  Since this code
is only meant to be a minimal example of a parser, I hope you don't mind.
*grin* Does this help?

Good luck to you!