On Mon, Apr 6, 2020 at 5:18 AM Jeff Allen <ja.py@farowl.co.uk> wrote:

The PEP gives a good exposition of the problem and proposed solution, thanks.

If I understand correctly, the proposal is that the PEG grammar should become the definitive grammar for Python at some point, probably for Python 3.10, so it may evolve without the LL(1) restrictions. I'd like to raise some points with respect to that, which perhaps the migration section could answer.

Thanks, you definitely have a point here.

When definitive, the grammar would not then just be for CPython, and would also appear as user documentation of the language. Whether that change leaves Python with a more useful (readable) grammar seems an important test of the idea. I'm looking at https://github.com/we-like-parsers/cpython/blob/pegen/Grammar/python.gram , and assuming that is indicative of a future definitive grammar. That may be incorrect, as it has these issues in my view:

1. It is decorated with actions in C. If a decorated grammar is offered as definitive, one with Python actions (operations on the AST) is preferable, as implementation neutral, although still hostage to AST changes that are not language changes. Maybe one stripped of actions is best.

Yes, the plan is to strip actions and a few other embellishments (types, names, cuts, and probably also lookaheads -- although the latter may be significant, we only use them for optimization). The parser generator (https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator) prints a stripped representation (though currently preserving lookaheads -- suppressing those would be a simple change to the code).

2. It's quite long, and not at first glance more readable than the LL(1) grammar. I had understood ugliness in the LL(1) grammar to result from skirting limitations that PEG eliminates. The PEG one is twice as long, but recognising about half of it is actions, let's just say that as a grammar it's no shorter.

Indeed. I believe part of this actually comes from the desire to be 100% compatible with the old parser (an important constraint is that we don't want to change the AST since we don't want to change the byte code generator).

Another part of it comes from expressing in the grammar constraints that the old parser generator cannot express. For example, the old parser accepts `1 = x` as an assignment, and it is rejected in a later stage. The new parser expresses this restriction in the grammar. Note that the full grammar published in the reference manual (https://docs.python.org/3.8/reference/grammar.html) doesn't say anything about this; the grammar used later to describe assignment_stmt does (https://docs.python.org/3.8/reference/simple_stmts.html#grammar-token-assignment-stmt), but as a result it is not LL(1) -- those grammar sections sprinkled throughout the reference manual are all written and updated by hand (and sometimes we forget!).

3. There is some manual guidance by means of &-guards, only necessary (I think) as a speed-up or to force out meaningful syntax errors. That would be noise to the reader. (This goes away if the PEG parser generator generate guards from the first set at a simple "no backtracking" marker.)

Yeah, see above. We've thought of generating FIRST sets as a future enhancement of the generator, and then they can go away. At the moment the lookaheads we have are all carefully aimed at optimizing the time and space requirements of the parser.

4. In some places, expansive alternatives seem to be motivated by the difference between actions, for a start, wherever async pops up. Maybe it is also why the definition of lambda is so long. That could go away with different support code (e.g. is_async as an argument), but if improvements to the support change grammar rules, when the language has not changed, that's a danger sign too.

Yeah, lambda is complicated by the requirement on the generated AST. Arguably we have gone too far here (and for 'parameters', which solves almost the same problem for regular function definitions) and we should put some of the checks back in the support code. But I note that the old grammar also has some warts in the area of parameter definitions (though its lambda is definitely simpler).

All that I think means that the "operational" grammar from which you build the parser is going to be quite unlike the one with which you communicate the language. At present ~/Grammar/Grammar both generates the parser (I thought) and appears as documentation. I take it to be the ideal that we use a single, human-readable definition. For example ANTLR 4 has worked hard to facilitate a grammar in which actions are implicit, and the generation of an AST from the parse tree/events can be elsewhere. (I'm not plugging ANTLR specifically as a solution.)

Our cheaper solution is to remove the actions from the display grammar. But I don't think that Grammar/Grammar should be seen as a complete specification of the language. And I don't think it is terrible if the specification says

  function_def_raw:
      | ASYNC 'def' NAME '(' parameters? ')' ['->' annotation] ':' block
      | 'def' NAME '(' parameters? ')' ['->' annotation] ':' block

instead of

  function_def_raw: [ASYNC] 'def' NAME '(' parameters? ')' ['->' annotation] ':' block

--
--Guido van Rossum (python.org/~guido)