Is pyparsing really a recursive descent parser?

Sat Nov 3 22:07:24 EDT 2007

On 2007-11-04, Just Another Victim of the Ambient Morality
<ihatespam at hotmail.com> wrote:
>
> "Neil Cerutti" <horpner at yahoo.com> wrote in message 
> news:oz3Xi.39812$G23.23308 at newsreading01.news.tds.net...
>> On 2007-11-03, Paul McGuire <ptmcg at austin.rr.com> wrote:
>>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
>>><ihates... at hotmail.com> wrote:
>>>>     It has recursion in it but that's not sufficient to call it a 
>>>> recursive
>>>> descent parser any more than having a recursive implementation of the
>>>> factorial function.  The important part is what it recurses through...
>>>
>>><snip>
>>>
>>>> In my opinion, the rule set I mentioned in my original post:
>>>>
>>>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>>>
>>>>     ...should be translated (internally) to something like this:
>>>>
>>>> word = Word(alphas)
>>>> grammar = Forward()
>>>> grammar << ((word + grammar) | (word + Literal(end)))
>>>>
>>>>     This allows a recursive search to find the string correct without 
>>>> any
>>>> need for "backtracking," if I understand what you mean by that. 
>>>> Couldn't
>>>> pyparsing have done something like this?
>>>>
>>>
>>> Dear JAVotAM -
>>>
>>> This was a great post!  (Although I'm not sure the comment in the
>>> first paragraph was entirely fair - I know the difference between
>>> recursive string parsing and recursive multiplication.)
>>>
>>> You really got me thinking more about how this recursion actually
>>> behaves, especially with respect to elements such as OneOrMore.  Your
>>> original question is really quite common, and up until now, my stock
>>> answer has been to use negative lookahead.  The expression you posted
>>> is the first time I've seen this solution, and it *does* work.
>>
>> Is there not an ambiguity in the grammar?
>>
>> In EBNF:
>>
>>  goal --> WORD { WORD } END
>>
>>  WORD is '[a-zA-Z]+'
>>  END is 'end'
>>
>> I think it is fine that PyParsing can't guess what the composer
>> of that grammar meant.
>
> First, I don't know if that constitutes an ambiguity in the
> grammar. 'end' is a word but it's unambiguous that this grammar
> must end in a literal 'end'.  You could interpret the input as
> just a sequence of words or you could interpret it as a
> sequence of words terminated by the word 'end'.  

Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
regex, because a regex is generally just a recognizer; the
ambiguity doesn't have to do with recognizing the language.  But
PyParsing is parser. The ambiguity is in deciding what's a
Word(alphas) and what's an 'end'.

> One interpretation conforms to the grammar while the other
> doesn't. You would assume that the interpretation that agrees
> with the grammar would be the preferable choice and so should
> the program. Secondly, even if it is an ambiguity... so what?
> pyparsing's current behaviour is to return a parse error,
> pretending that the string can't be parsed.  Ideally, perhaps
> it should alert you to the ambiguity but, surely, it's better
> to return _a_ valid parsing than to pretend that the string
> can't be parsed at all...

I wouldn't characterize it as pretending. How would you parse:

  hello end hello end

"WORD END WORD END" and "WORD WORD WORD END" are both valid
interpretations, according to the grammar.

As soon as you remove the ambiguity from the grammar, PyParsing
starts to work correctly.

Consider writing a recursive decent parser by hand to parse the
language '[ab]+b'.

  goal --> ab_list 'b'
  ab_list --> 'a' list_tail
  ab_list --> 'b' list_tail
  list_tail --> 'a' list_tail
  list_tail --> 'b' list_tail
  list_tail --> null

The above has the exact same bug (and probably some others--I'm
sorry unable to test it just now) as the PyParsing solution.

The error is in the grammar. It might be fixed by specifying that
'b' must be followed by EOF, and then it could be coded by using
more than one character of lookahead.

class ParseError(Exception):
    pass

class Parser:
  def __init__(self, s):
    self.s = s
    self.i = 0
    self.c = self.s[self.i]

  def match(self, c):
    if self.c != c:
      raise ParseError('expected %s got %s' % (c, self.c))
    self.i += 1
    if self.i == len(self.s):
      raise ParseError('unexpected EOF')
    self.c = self.s[self.i]

  def goal(self):
    self.ab_list()
    self.match('b')

  def ab_list(self):
    if self.c in 'ab':
      self.match(self.c)
      self.list_tail()
    else:
      raise ParseError('expected list of a or b, got %s' % self.c)

  def list_tail(self):
    if self.c in 'ab':
      self.match(self.c)
      self.list_tail()
    else:
      raise ParseError('expected a list of a or b, got %s' % self.c)

>>> Parser('aaab').goal()
Traceback (most recent call last):
  File "py.py", line 37, in ?
    Parser('aaab').goal()
  File "py.py", line 20, in goal
    self.ab_list()
  File "py.py", line 26, in ab_list
    self.list_tail()
  File "py.py", line 33, in list_tail
    self.list_tail()
  File "py.py", line 33, in list_tail
    self.list_tail()
  File "py.py", line 32, in list_tail
    self.match(self.c)
  File "py.py", line 16, in match
    raise ParseError('unexpected EOF')
__main__.ParseError: unexpected EOF

It's not a coincidence that is has the same bug as the PyParsing
solution. You can remove the ambiguity in several ways, perhaps
by specifying where EOF should be (you seem to be assuming this
in your interpretation of the grammar, but PyParsing does not).

  goal --> ab_list 'b' EOF
  ab_list --> 'a' list_tail
  ab_list --> 'b' list_tail
  list_tail --> 'a' list_tail
  list_tail --> 'b' list_tail
  list_tail --> null

I need to change goal, match and list_tail for this new grammar,
and add an EOF object and a peek method.

...
EOF = object()
...
  def match(self, c):
    if self.c != c:
      raise ParseError('expected %s got %s' % (c, self.c))
    if self.i >= len(self.s)-1:
        self.c = EOF
        self.i = len(self.s)
    else:
        self.i += 1
        self.c = self.s[self.i]

  def peek(self, lookahead):
      if self.i+lookahead >= len(self.s):
          return EOF
      else:
          return self.s[self.i + lookahead]
  ...
  def list_tail(self):
    if self.c == 'a':
        self.match('a')
        self.list_tail()
    elif self.c == 'b':
        if self.peek(1) != EOF:
            self.match('b')
            self.list_tail()
    else:
      raise ParseError('expected list of a or b, got %s' % self.c)

The grammar now has the problem that it's LL(2) rather than
LL(1). PyParsing can handle that type of grammar with negative
lookahead assertions, I think.

-- 
Neil Cerutti