[Python-Dev] Grammar help.
Guido van Rossum
guido@beopen.com
Thu, 06 Jul 2000 18:53:51 -0500
> On 07 July 2000, Thomas Wouters said:
> > Argh, my head hurts ! :P
Greg Ward replies:
> Yeah, but try writing your *own* parser-generator. >grin<
>
> > Here's what I did. The current definition of 'atom' is:
> >
> > atom: '(' [testlist] ')' | '[' [testlist] ']' | << ... >>
> > testlist: test (',' test)* [',']
> >
> > I tried adding it like this:
> >
> > atom: '(' [testlist] ')' | '[' [listmaker] ']' | << ... >>
> > testlist: test (',' test)* [',']
> > listmaker: testlist | rangemaker
> > rangemaker: [test] ':' test
>
> I think the problem is this:
> * one possible candidate for atom is: '[' listmaker
> * two candidates for listmaker: testlist or rangemaker
> * but both of these start with a test; the fact that one of them
> is optional is probably beside the point
> * the parser needs to see ',' or ':' to know if it's got a
> testlist or a rangemaker: but because a test could be any number
> of tokens, it would need infinite lookahead. (I think.) Since
> this is a recursive-descent parser, it is most likely LL(1) --
> at most one token of lookahead. Not good enough.
>
> Another, entirely hypothetical way of looking at it (based on experience
> with a completely different recursive descent parser generator
> (PCCTS/ANTLR) and dim recollections of the Dragon Book):
> * parser sees a '[' token and enters the 'atom' state
> (ie. calls a subroutine 'atom()')
> * swallows the '[' and immediately enters 'listmaker' (no other option)
> * now it wants to enter either 'testlist' or 'rangemaker': but the
> FIRST(testlist) == FIRST(test) (testlist can only start with a
> test), and FIRST(rangemaker) == FIRST(test) + ':' (rangemaker
> can start with a test or the ':' token)
> * I suspect this is the ambiguity: overlap in the FIRST sets of
> testlist and rangemaker
Greg's got it right. It's a recursive descent a.k.a. LL(1) parser.
The solution? Try to rewrite the grammar to avoid the ambiguities.
So we have [testlist | rangemaker] for listmaker, where testlist is
test(','test)* and rangemaker is [test]':'[test]. (Note that you'll
have to add [':'[test]] to support [lo:hi:step], but tht's not the
problem here.)
Rewrite this as
listmaker: [rangetail | test((','test)* | rangetail)]
rangetail: ':' [test] [':' [test]]
Now the two alternatives have different FIRST sets.
The disadvantage is that you have to change the code generator to work
withthis mess -- that's why people like LALR(1) and other parsers.
But LL(1) was all I could create myself (Python uses a homegrown
parser generator).
--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)