Is pyparsing really a recursive descent parser?

Sat Nov 3 18:40:39 EDT 2007

"Paul McGuire" <ptmcg at austin.rr.com> wrote in message 
news:1194104971.263256.66640 at d55g2000hsg.googlegroups.com...
> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
> <ihates... at hotmail.com> wrote:
>>     It has recursion in it but that's not sufficient to call it a 
>> recursive
>> descent parser any more than having a recursive implementation of the
>> factorial function.  The important part is what it recurses through...
>
> <snip>
>
>> In my opinion, the rule set I mentioned in my original post:
>>
>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>
>>     ...should be translated (internally) to something like this:
>>
>> word = Word(alphas)
>> grammar = Forward()
>> grammar << ((word + grammar) | (word + Literal(end)))
>>
>>     This allows a recursive search to find the string correct without any
>> need for "backtracking," if I understand what you mean by that.  Couldn't
>> pyparsing have done something like this?
>>
>
> Dear JAVotAM -
>
> This was a great post!  (Although I'm not sure the comment in the
> first paragraph was entirely fair - I know the difference between
> recursive string parsing and recursive multiplication.)

    I often use hyperbole to emphasize a point.  I honestly mean no offense. 
That comment wasn't even meant to be fair, I was hoping to be provocative...

> You really got me thinking more about how this recursion actually
> behaves, especially with respect to elements such as OneOrMore.  Your
> original question is really quite common, and up until now, my stock
> answer has been to use negative lookahead.  The expression you posted
> is the first time I've seen this solution, and it *does* work.

    I'm glad I got you thinking!  I'd hate to be another newbie with another 
of a thousand questions answered in the FAQ....
    Hey, are you the author of pyparsing?

> I was all set to write a cocky response on why your expression
> wouldn't work.  I've seen it many times before, where people (usually
> coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
>
> listOfXs = Forward()
> listOfXs << ( x + listOfXs | x )
>
> Actually, what they usually write is:
>
> listOfXs << ( listOfXs + x )
>
> but this sends pyparsing into a recursive tailspin.
>
> So I fired up SciTE and copy/pasted your code into the editor and ran
> it, and it worked just fine - this was a shock!  I studied this for a
> few minutes, and then realized what was happening.  First of all, I
> misread what you posted.  You posted this:
>
> grammar << ((word + grammar) | (word + Literal(end)))
>
> which works.  I *thought* you posted this:
>
> grammar << ((word + grammar) | word) + Literal(end)
>
> which doesn't work.  In fact this behaves the same as your original
> post, except it iterates over the input string's words recursively,
> vs. repetition ins a for-loop, as is done by OneOrMore.

    I'm grateful that you actually tested my code before posting your cocky 
response!

> So going back to your working version, I had to see why it works.
> Initially, the first term in the MatchFirst (the '|' operator creates
> MatchFirst instances) is just the same, and by grammar referencing
> itself, it just goes word by word through the input trying to find a
> match.  I'll try to visualize the steps:
>
> level    "First Second Third end"
> 1         word  grammar
> 2               word   grammar
> 3                      word  grammar
> 4                            word grammar <- fails!
> 4                            word end <- fails!
>                             (therefore level 3 grammar fails!)
> 3                      word  end    <-- success!!!
>
> grammar has 2 options: to match a word followed by a grammar, or to
> match a word followed by 'end'.  At 4 levels deep into the Forward's
> recursion, the first option fails, because after parsing "end" as the
> initial word, there is no more text to try to match against grammar.
> Level 4's Forward then also tries to match a word followed by 'end',
> but this fails for the same reason.  So at this point, the 4th level
> Forward fails to match either of its options, so it throws its
> exception back up to level 3, indicating that the first alternative,
> word followed by grammar, failed.  Level 3 then moves on to see if
> word followed by the literal 'end' matches, and it does - success!

    This is, literally, what it's doing.  I'm not exactly a programming whiz 
so I think of it a little more abstractly.  In pyparsing's implementation, 
it recurses through the first rule, OneOrMore, then iteratively moves to the 
next rule, only to fail.  So, obviously that rule must be part of the 
recursion if we're not to use backtracking or lookaheads.  If it's part of 
the recursion, it has to be the terminal case.  We know that the OneOrMore 
rule is necessarily followed by the literal, so we can safely conclude that 
the terminal case will necessarily be followed by the literal, hence the 
production mentioned...

> Here's where I am stuck now.  In the original grammar that you posted,
> which you want to render into this behavior, grammar is defined as:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> Unfortunately, when the OneOrMore gets constructed, it does not have
> any visibility beyond knowing what is to be repeated.  Again, here is
> the data structure that is being built:
>
> - And
>  - OneOrMore
>    - Word(alphas)
>  - Literal('end')
>
> Only at the level of the And is there any awareness that the OneOrMore
> is followed by anything, let alone by something which could be
> misinterpreted as something matching the OneOrMore's repetition
> expression.
>
> Can you suggest a way I could generalize this, so that OneOrMore stops
> matching before it gets to 'end'?

    Again, I'm not an elite hacker so it's hard to give a good suggestion. 
I might have implemented the parser by having its input be a string 
specifying the grammar but, then again, I'd need a parser to prase that 
string!  Oh, the irony...
    Probably, I would have constructed the rule tree differently.  I like 
how pyparsing uses the Python interpreter to parse the input string, the 
Python code, itself, as pyparsing's input.  That's worth preserving...
    So, instead of organizing the tree by rule categories, as pyparsing 
does:

-And
    -OneOrMore
        -Word(alphas)
    -Literal('end')

    ...I would have organized it by what the grammar actually expects.  It's 
hard for me to describe what I'm thinking as a simple hierarchy.  I could 
use a directed graph but that's hard to type, so I'll just use "labels":

start:
    call OneOrMore:

OneOrMore:
    Or(
        And(
            Word(alphas),
            call OneOrMore:),
        And(
            Word(alphas),
            Literal('end')))

    ...hopefully, that indentation makes some sort of sense.  I would 
imagine that it would be constructed like this:

# This translates to temp << ((word + temp) | word)
temp = OneOrMore(word)

# Instead of wrapping this in an And structure,
# tell temp to And this with its terminal case...
grammar = temp + literal

    In my parser, everything would be recursive, hence the term "recursive 
descent" parser.  So, a list of Ands like:

grammar = Literal('a') + Literal('b') + Literal('c')

    ...would look like this:

start:
    call And_a:

And_a:
    Literal('a') + call And_b:

And_b:
    Literal('b') + call And_c:

And_c:
    Literal('c')

    ...notice the utter lack of iteration.
    Now, this example isn't, strictly speaking, recursive, but it descends 
through a set of function calls that are, potentionally, recursive...
    Now, obviously things like Or will have iterate through its cases but 
that's unavoidable and should be okay.  The other constructions should need 
some thought, too.  However, hopefully, this technique will render some 
constructions unnecessary...
    So, in my parser, the grammar is represented by a directed graph (the 
temporary variable flowing through operations like a + b + c + d) and all 
operations modify this graph.  Obviously, some research needs to be done on 
how feasible this implementation is, especially with the rich set of 
operations available in pyparsing, but I'm hopeful...
    Hopefully this all makes sense.  Thanks for reading...