[Python-Dev] Grammar help.

Thu, 06 Jul 2000 18:53:51 -0500

> On 07 July 2000, Thomas Wouters said:
> > Argh, my head hurts ! :P

Greg Ward replies:

> Yeah, but try writing your *own* parser-generator.  >grin<
> 
> > Here's what I did. The current definition of 'atom' is:
> > 
> > atom: '(' [testlist] ')' | '[' [testlist] ']' | << ... >>
> > testlist: test (',' test)* [',']
> > 
> > I tried adding it like this:
> > 
> > atom: '(' [testlist] ')' | '[' [listmaker] ']' | << ... >>
> > testlist: test (',' test)* [',']
> > listmaker: testlist | rangemaker
> > rangemaker: [test] ':' test
> 
> I think the problem is this:
>   * one possible candidate for atom is: '[' listmaker
>   * two candidates for listmaker: testlist or rangemaker
>   * but both of these start with a test; the fact that one of them
>     is optional is probably beside the point
>   * the parser needs to see ',' or ':' to know if it's got a
>     testlist or a rangemaker: but because a test could be any number
>     of tokens, it would need infinite lookahead.  (I think.)  Since
>     this is a recursive-descent parser, it is most likely LL(1) --
>     at most one token of lookahead.  Not good enough.
> 
> Another, entirely hypothetical way of looking at it (based on experience
> with a completely different recursive descent parser generator
> (PCCTS/ANTLR) and dim recollections of the Dragon Book):
>   * parser sees a '[' token and enters the 'atom' state
>     (ie. calls a subroutine 'atom()')
>   * swallows the '[' and immediately enters 'listmaker' (no other option)
>   * now it wants to enter either 'testlist' or 'rangemaker': but the
>     FIRST(testlist) == FIRST(test) (testlist can only start with a
>     test), and FIRST(rangemaker) == FIRST(test) + ':' (rangemaker
>     can start with a test or the ':' token)
>   * I suspect this is the ambiguity: overlap in the FIRST sets of
>     testlist and rangemaker

Greg's got it right.  It's a recursive descent a.k.a. LL(1) parser.
The solution?  Try to rewrite the grammar to avoid the ambiguities.

So we have [testlist | rangemaker] for listmaker, where testlist is
test(','test)* and rangemaker is [test]':'[test].  (Note that you'll
have to add [':'[test]] to support [lo:hi:step], but tht's not the
problem here.)

Rewrite this as

listmaker: [rangetail | test((','test)* | rangetail)]
rangetail: ':' [test] [':' [test]]

Now the two alternatives have different FIRST sets.

The disadvantage is that you have to change the code generator to work
withthis mess -- that's why people like LALR(1) and other parsers.
But LL(1) was all I could create myself (Python uses a homegrown
parser generator).

--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)