PEP 617: New PEG parser for CPython

Since last fall's core sprint in London, Pablo Galindo Salgado, Lysandros Nikolaou and myself have been working on a new parser for CPython. We are now far enough along that we present a PEP we've written: https://www.python.org/dev/peps/pep-0617/ Hopefully the PEP speaks for itself. We are hoping for a speedy resolution so we can land the code we've written before 3.9 beta 1. If people insist I can post a copy of the entire PEP here on the list, but since a lot of it is just background information on the old LL(1) and the new PEG parsing algorithms, I figure I'd spare everyone the need of reading through that. Below is a copy of the most relevant section from the PEP. I'd also like to point out the section on performance (which you can find through the above link) -- basically performance is on a par with that of the old parser. ============== Migration plan ============== This section describes the migration plan when porting to the new PEG-based parser if this PEP is accepted. The migration will be executed in a series of steps that allow initially to fallback to the previous parser if needed: 1. Before Python 3.9 beta 1, include the new PEG-based parser machinery in CPython with a command-line flag and environment variable that allows switching between the new and the old parsers together with explicit APIs that allow invoking the new and the old parsers independently. At this step, all Python APIs like ``ast.parse`` and ``compile`` will use the parser set by the flags or the environment variable and the default parser will be the current parser. 2. After Python 3.9 Beta 1 the default parser will be the new parser. 3. Between Python 3.9 and Python 3.10, the old parser and related code (like the "parser" module) will be kept until a new Python release happens (Python 3.10). In the meanwhile and until the old parser is removed, **no new Python Grammar addition will be added that requires the peg parser**. This means that the grammar will be kept LL(1) until the old parser is removed. 4. In Python 3.10, remove the old parser, the command-line flag, the environment variable and the "parser" module and related code. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, 2 Apr 2020 at 19:20, Guido van Rossum <guido@python.org> wrote:
Excellent news! One question - will there be any user-visible change as a result of this PEP other than the removal of the "parser" module? From my quick reading of the PEP, I didn't see anything, so I assume the answer is "no". Paul

On Thu, Apr 2, 2020 at 12:43 PM Paul Moore <p.f.moore@gmail.com> wrote:
I suppose it depends on how deep you dig, but the intention is that the returned AST is identical in each case. (We've "cheated" a bit by making a few small changes to the code that produces an AST for the old parser, mostly bugs related to line/column numbers.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Great to see this new work pay off! On Apr 2, 2020, at 11:10, Guido van Rossum <guido@python.org> wrote:
2. After Python 3.9 Beta 1 the default parser will be the new parser.
Just to clarify, this means that 3.9 will ship with the PEG parser as default, right? If so, this would be a new feature, post beta. Since that is counter to our general policy, we would need to get explicit RM approval for such a change. Cheers, -Barry

On Thu, Apr 2, 2020 at 1:21 PM Barry Warsaw <barry@python.org> wrote:
That was the intention, i.e. releasing beta 1 with the new parser being the default. The current wording in the PEP are wrong, we'll fix that. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Hi, It's great to see that you finally managed to come up with a PEP, this work becomes concrete: congrats! I started to read the PEP, and it's really well written! I heard that LL(1) parsers have limits, but this PEP explains very well that the current Python grammar was already "hacked" to work around these limitations. I also like the fact that PEG is deterministic, whereas LL(1) parsers are not. I like to have the new parser being the default, it will ease its adoption and force users to adapt their code. Otherwise, the migration may take forever and never complete :-( -- About the migration, can I ask who is going to (help to) fix projects which rely on the AST? I know that the motto was always "we don't provide any backward compatibility warranty on the AST", *but* more and more projects are using the Python AST. Examples of projects relying on the AST: * gast: used by Pythran * pylint uses astroid * Chameleon * Genshi * Mako * pyflakes * (likely others) I'm not asking to stop making AST changes. I'm following AST changes, and the AST is becoming better and better at each Python release! I'm just asking is there are volunteers around to help to make these projects compatible with Python 3.9, before the Python 3.9.0 final release (to accelerate the adoption of Python 3.9). These volunteers don't have to be the ones behind the PEP 617. Note: example of previous AST incompatible changes (use ast.Constant, remove old AST classes) in Python 3.8: https://bugs.python.org/issue32892 A compatibility layer was added to ease the migration from old AST classes to the new ast.Constant. Victor Le jeu. 2 avr. 2020 à 20:15, Guido van Rossum <guido@python.org> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

About the migration, can I ask who is going to (help to) fix projects which rely on the AST?
I think you misunderstood: The AST is exactly the same as the old and the new parser. The only the thing that the new parser does is not generate an immediate CST (Concrete Syntax Tree) and that is only half-exposed in the parser module.

On Thu, Apr 2, 2020 at 2:48 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
If the AST is supposed to be the same, then would it make sense to temporarily – maybe just during the alpha/beta period – always run *both* parsers and confirm that they match? -n -- Nathaniel J. Smith -- https://vorpus.org

On Thu, Apr 2, 2020 at 4:20 PM Nathaniel Smith <njs@pobox.com> wrote:
That's not a bad idea! https://github.com/we-like-parsers/cpython/issues/33 -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 02, 2020 at 05:17:31PM -0700, Guido van Rossum wrote:
Even just running it in a dev build against the corpus of the top few thousand packages on pypi might give enough confidence -- I had a script to download the top N packages and run some script over the python files contained therein, but I can't seem to find it atm. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On Thu, Apr 2, 2020 at 7:55 PM Matt Billenstein <matt@vazor.com> wrote:
We got that. Check https://github.com/gvanrossum/pegen/tree/master/scripts -- look at download_pypi_packages.py and test_pypi_packages.py. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 02, 2020 at 08:57:30PM -0700, Guido van Rossum wrote:
Very nice! m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

About the migration, can I ask who is going to (help to) fix projects which rely on the AST?
Whoops, I send the latest email before finishing it by mistake. Here is the extended version of the answer: I think there is a misunderstanding here: The new parser generates the same AST as the old parser so calling ast.parse() or compile() will yield exactly the same result. We have extensive testing around that and that was a goal from the beginning. Projects using the ast module will not need to do anything special. The difference is that the new parser does not generate a CST (Concrete Syntax Tree). The concrete syntax tree is an immediate structure from where the AST is generated. This structure is only partially exposed via the "parser" module but otherwise is only used in the parser itself so it should not be a problem. On the other hand: as explained in the PEP, the lack of the CST greatly simplifies the AST generation among other advantages.

On Thu, Apr 2, 2020 at 2:55 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
I think that's only half true. It's true if they already work with Python 3.9 (master/HEAD). But probably some of these packages have not yet started testing with 3.9 nightly runs or even alphas, so it's at least *conceivable* that some of the fixes we applied to the AST could require (small) adjustments. And I think *that* was what Victor was referring to. (For example, I'm not 100% sure that mypy actually works with the latest 3.9. But there seems to be something else wrong there so I can't even test it.)
I just remembered another difference. We haven't really investigated how good the error reporting is. I'm sure there are cases where the syntax error points at a *slightly* different position -- sometimes it's a bit better, sometimes a bit worse. But there could be cases where the PEG parser reads ahead chasing some alternative that will fail much later, and then it would be much worse. We should probably explore this. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Sorry, I was referring to *ambiguous* grammar rules. Extract of the PEP: "Unlike LL(1) parsers PEG-based parsers cannot be ambiguous: if a string parses, it has exactly one valid parse tree. This means that a PEG-based parser cannot suffer from the ambiguity problems described in the previous section." Victor Le ven. 3 avr. 2020 à 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

Le ven. 3 avr. 2020 à 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
On Thu, Apr 2, 2020 at 6:15 PM Victor Stinner <vstinner@python.org> wrote:
Maybe we need to rephrase this a bit. It's more that the LL(1) and PEG formalisms deal very different with ambiguous *grammars*. An example of an ambiguous grammar would be: start: X | Y X: expr Y: expr expr: NAME | NAME '+' NAME There are probably better examples of ambiguous grammars (see https://en.wikipedia.org/wiki/Ambiguous_grammar) but I think this will do to explain the problem. This is a fine context-free grammar (it accepts strings like "a" and "a+b") but the LL(1) formalism will reject it because it sees an overlap in FIRST sets between X and Y -- not surprising because they have the same RHS. Also, even a more powerful formalism would have to make a choice whether to choose X or Y, which may matter if the derivation is used to build a parse tree (like Python's pgen does). OTOH a PEG parser generator will always take the X alternative -- it doesn't care that there's more than one derivation, since its '|' operator is not symmetrical: X|Y and Y|X are not the same, as they are in LL(1) and most other formalisms. (In fact, the common notation for PEG uses '/' to emphasize this, but it looks ugly to me so I changed it to '|'.) That PEG (by definition) always uses the first matching alternative is actually a blessing as well as a curse. The downside is that PEG can't tell you when have a real ambiguity in your grammar. But the upside is that it works like a programmer would write a (recursive descent) parser. Thus it "solves" the problem of ambiguous grammars by choosing the first alternative. This allows more freedom in designing a grammar. For example, it would let a language designer solve the "dangling else" problem from the Wikipedia page, by writing the form including the "else" clause first . (Python doesn't have that problem due to the use of indentation, but it might appear in another disguise.) I should probably refine this argument and include it in the PEP as one of the reasons to prefer PEG over LR or LALR (but I need to think more about that -- it was a very early choice). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 3/04/20 3:22 pm, Guido van Rossum wrote:
I'm inclined to think that such problems shouldn't be solved at the parser level, but rather at the language level, i.e. don't design the language that way in the first place. After all, if it's confusing to the computer, it's probably going to be confusing to humans as well. (I note that all of Wirth's languages after Pascal changed the syntax so as not to have a dangling else problem.) Personally I would rather my parser generator *did* complain about ambiguities, so that I can facepalm myself for designing my language in such a stupid way. -- Greg

On 3/04/20 2:13 pm, Victor Stinner wrote:
That paragraph seems rather confused. I think what it *might* be trying to say is that a PEG parser allows you to write productions with overlapping first sets (which would be "ambiguous" for an LL parser), but still somehow guarantees that a unique parse tree is produced. The latter suggests that the grammar as a whole still needs to be unambiguous. -- Greg

We may need to rephrase this to make it a bit more clear, but this is trying to say that PEG grammars cannot be ambiguous in the same sense as context-free grammars are normally said to be ambiguous. Notice that an ambiguous grammar is normally defined (for instance here https://en.wikipedia.org/wiki/Ambiguous_grammar) only for context-free grammars as a grammar with more than one possible parse tree. In the PEG formalism as Guido explained in the previous email there is only one possible parse tree because the parser always chooses the first option. As a consequence of this (and as a particular case of this) and as you mention, the PEG formalism allows writing productions with overlapping first sets. Also, notice that first sets are mainly relevant for LL(k) parsers and the like because those need to *deduce* which alternative to follow given multiple choices in production while PEG will always try in order. In general, the argument is that because of how PEG works, it will only be one parse tree and this makes the grammar "not ambiguous" under the typical definition for ambiguity for context-free grammars (having multiple parse trees).

On 3/04/20 7:10 am, Guido van Rossum wrote:
Was any consideration given to other types of parser, such as LR or LALR? LR parsers handle left recursion naturally, and don't suffer from any of the drawbacks mentioned in the PEP such as taking exponential time or requiring all the source to be loaded into memory. I think there needs to be a section in the PEP justifying the choice of PEG over the alternatives. -- Greg

On 4/04/20 9:29 am, Brett Cannon wrote:
I think "needs" is a bit strong. It would be nice, though. Regardless, as long as this is a net improvement over the status quo I don't see this being rejected on the grounds that an LR or LALR parser would be better since we have a working PEG parser today. :)
Even if the section only says "We didn't consider any alternatives, because...", I still think it should be there. -- Greg

Thanks, Guido, Pablo, Lysandros, that's a great PEP. Also thanks to everyone else working on the PEG parser over the last year, like Emily. I know it's a lot of work but as someone who's intimately aware of the headaches caused by the LL(1) parser, I greatly appreciate it :). The only thing I'm missing from the PEP is more detail about how the cross-language nature of the parser actions are handled. The example covers just C, and the description of the actions says they're C expressions. The only mention of Python code generation is for alternatives without actions. Is the intent that the actions are cross-language, or translated to Python somehow, or is the support for generating a Python-based parser merely for debugging, as that action suggests? -- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

Oh, good point. Thanks for pointing that out. We certainly need to explain that a bit better. The current situation is that actions support both Python and C code. They are basically pieces of code that will be included in the resulting program, no matter on what language is written in. For instance, we use the Python generator to generate the code that parses the grammar for the generator itself. The output is written in Python and the metagrammar uses actions written in Python: https://github.com/we-like-parsers/cpython/blob/pegen/Tools/peg_generator/pe... So regarding the usage of Python code generation, is certainly useful for debugging but is actually used by the generator itself to bootstrap a section of it (the one that parses grammars). The feeling of bootstrapping parsers never gets old and is one of the most fun parts to do :) I will prepare a PR soon to complement the section about actions in the PEP.

The only thing I'm missing from the PEP is more detail about how the cross-language nature of the parser actions are handled.
Expanded the "actions" section in the PEP here: https://github.com/python/peps/pull/1357

The tl;dr is that actions specified in the grammar are specific to the target language. So if you want to use the pegen tool to generate both Python and C code for the same grammar, you would need two grammar files with the same grammar but different actions. Since our goal here is just to generate a parser for use in CPython that's not a problem. Other PEG parser generators make different choices, e.g. TatSu puts semantics actions in a separate file (https://tatsu.readthedocs.io/en/stable/semantics.html). On Sun, Apr 5, 2020 at 11:06 AM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 6/04/20 2:08 am, Jelle Zijlstra wrote:
And related to that, how precisely will it be able to pinpoint the location of the error? The backtracking worries me a bit in that regard. I can imagine it trying all possible ways to parse the input and then only being able to say "Something is wrong somewhere in this file." -- Greg

Unfortunately they look pretty much the same. We're actually currently trying to improve the error messages for situations where the old parser produces something specialized (mostly because the LL(1) grammar can't express something and the check is done in a later pass).
There's no need to worry about this: in almost all cases the error indicator points to the same spot in the source code as with the old parser. I was worried about this too, but it really doesn't seem to be a problem -- I think this might be different with highly ambiguous grammars, but since Python's grammar is still *mostly* LL(1), it looks like we're fine. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 6/04/20 4:48 am, Guido van Rossum wrote:
I'm curious about how that works. From the description in the PEP, it seems that none of the individual parsing functions can report an error, because there might be another branch higher up that succeeds. Does it keep track of the maximum distance it got through the source or something like that? -- Greg

On Sun, Apr 5, 2020 at 5:16 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I guess you could call it that. There is a small layer of abstraction between the actual tokenizer (which cannot go back) and the generated parser functions. This abstraction buffers tokens. When a parser function wants a token it calls into this abstraction, and that either satisfies it from its buffer, or if there is no lookahead in the buffer left, calls the actual tokenizer. When a parser function fails, it calls into the abstraction layer to back up to a previous point (which I call the "mark"). (A simplified version of this layer is shown in my blog post, https://medium.com/@gvanrossum_83706/building-a-peg-parser-d4869b5958fb -- the class Tokenizer.) When an error bubbles all the way up, we report a SyntaxError pointing to the farthest token that the abstraction has buffered (self.pos in the blog post). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

The PEP gives a good exposition of the problem and proposed solution, thanks. If I understand correctly, the proposal is that the PEG grammar should become the definitive grammar for Python at some point, probably for Python 3.10, so it may evolve without the LL(1) restrictions. I'd like to raise some points with respect to that, which perhaps the migration section could answer. When definitive, the grammar would not then just be for CPython, and would also appear as user documentation of the language. Whether that change leaves Python with a more useful (readable) grammar seems an important test of the idea. I'm looking at https://github.com/we-like-parsers/cpython/blob/pegen/Grammar/python.gram , and assuming that is indicative of a future definitive grammar. That may be incorrect, as it has these issues in my view: 1. It is decorated with actions in C. If a decorated grammar is offered as definitive, one with Python actions (operations on the AST) is preferable, as implementation neutral, although still hostage to AST changes that are not language changes. Maybe one stripped of actions is best. 2. It's quite long, and not at first glance more readable than the LL(1) grammar. I had understood ugliness in the LL(1) grammar to result from skirting limitations that PEG eliminates. The PEG one is twice as long, but recognising about half of it is actions, let's just say that as a grammar it's no shorter. 3. There is some manual guidance by means of &-guards, only necessary (I think) as a speed-up or to force out meaningful syntax errors. That would be noise to the reader. (This goes away if the PEG parser generator generate guards from the first set at a simple "no backtracking" marker.) 4. In some places, expansive alternatives seem to be motivated by the difference between actions, for a start, wherever async pops up. Maybe it is also why the definition of lambda is so long. That could go away with different support code (e.g. is_async as an argument), but if improvements to the support change grammar rules, when the language has not changed, that's a danger sign too. All that I think means that the "operational" grammar from which you build the parser is going to be quite unlike the one with which you communicate the language. At present ~/Grammar/Grammar both generates the parser (I thought) and appears as documentation. I take it to be the ideal that we use a single, human-readable definition. For example ANTLR 4 has worked hard to facilitate a grammar in which actions are implicit, and the generation of an AST from the parse tree/events can be elsewhere. (I'm not plugging ANTLR specifically as a solution.) Jeff Allen On 02/04/2020 19:10, Guido van Rossum wrote:

On Mon, Apr 6, 2020 at 5:18 AM Jeff Allen <ja.py@farowl.co.uk> wrote:
Thanks, you definitely have a point here.
Yes, the plan is to strip actions and a few other embellishments (types, names, cuts, and probably also lookaheads -- although the latter may be significant, we only use them for optimization). The parser generator ( https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator) prints a stripped representation (though currently preserving lookaheads -- suppressing those would be a simple change to the code).
Indeed. I believe part of this actually comes from the desire to be 100% compatible with the old parser (an important constraint is that we don't want to change the AST since we don't want to change the byte code generator). Another part of it comes from expressing in the grammar constraints that the old parser generator cannot express. For example, the old parser accepts `1 = x` as an assignment, and it is rejected in a later stage. The new parser expresses this restriction in the grammar. Note that the full grammar published in the reference manual ( https://docs.python.org/3.8/reference/grammar.html) doesn't say anything about this; the grammar used later to describe assignment_stmt does ( https://docs.python.org/3.8/reference/simple_stmts.html#grammar-token-assign...), but as a result it is not LL(1) -- those grammar sections sprinkled throughout the reference manual are all written and updated by hand (and sometimes we forget!).
Yeah, see above. We've thought of generating FIRST sets as a future enhancement of the generator, and then they can go away. At the moment the lookaheads we have are all carefully aimed at optimizing the time and space requirements of the parser.
Yeah, lambda is complicated by the requirement on the generated AST. Arguably we have gone too far here (and for 'parameters', which solves almost the same problem for regular function definitions) and we should put some of the checks back in the support code. But I note that the old grammar also has some warts in the area of parameter definitions (though its lambda is definitely simpler).
Our cheaper solution is to remove the actions from the display grammar. But I don't think that Grammar/Grammar should be seen as a complete specification of the language. And I don't think it is terrible if the specification says function_def_raw: | ASYNC 'def' NAME '(' parameters? ')' ['->' annotation] ':' block | 'def' NAME '(' parameters? ')' ['->' annotation] ':' block instead of function_def_raw: [ASYNC] 'def' NAME '(' parameters? ')' ['->' annotation] ':' block -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 2, 2020 at 3:16 PM Guido van Rossum <guido@python.org> wrote:
Hi Guido, I think using a PEG parser is interesting, but I do have some questions related to what's to expect in the future for other people which have to follow the Python grammar, so, can you shed some light on this? Does that mean that the grammar format currently available (which is currently specified in https://docs.python.org/3.8/reference/grammar.html) will no longer be updated/used? Is it expected that other language implementations/parsers also have to move to a PEG parser in the future? -- which would probably be the case if the language deviates strongly off LL(1) Thanks, Fabio

On Mon, Apr 6, 2020 at 4:03 AM Fabio Zadrozny <fabiofz@gmail.com> wrote:
The grammar format used for the PEG parser is nearly the same as the old grammar, when you remove actions and some embellishments needed for actions. The biggest difference is that the `|` operator is no longer symmetrical (since if you have alternatives `A | B`, and both match at some point in the input, PEG reports A, while the old generator would reject the grammar as being ambiguous.
We don't specify how other implementations must parse the language -- in fact I have no idea how the parsers of any of the other implementations work. I'm sure there will be other ways to parse the same language. But yeah, if there are implementations that currently closely follow Python's LL(1) parser structure they may have to be changed once we start introducing new syntax that makes use of the freedom PEG gives us. (For example, I've been toying with the idea of introducing a "match" statement similar to Scala's match expression by making "match" a keyword only when followed by an expression and a colon.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Mon, Apr 06, 2020 at 10:43:11AM -0700, Guido van Rossum wrote:
Didn't we conclude from `as` that having context-sensitive keywords was a bad idea? Personally, I would not like to have to explain to newcomers why `match` is a keyword but you can still use it as a function or variable, but not other keywords like `raise`, `in`, `def` etc. match expression: match = True -- Steven

On Mon, Apr 6, 2020 at 11:36 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'm not sure that that was the conclusion. At the time the point was that we *wanted* all keywords to be reserved everywhere, an `as` was an ugly exception to that rule, which we got rid of as soon as we could -- not because it was a bad idea but because it violated a somewhat arbitrary rule. We went through the same thing with `async` and `await`, and the experience there was worse: a lot of libraries in the very space `async def` was aimed at were using `async` as a parameter name, often in APIs, and they had to scramble to redesign their APIs and get their users to change their programs. In retrospect I wish we had just kept `async` as a context-sensitive keyword, since it was totally doable. (In an early version of the PEG parser, all keywords were context-sensitive, and there were only very few places in the grammar where this required us to insert negative lookaheads to make edge cases parse correctly. The rest was taken care by careful ordering of rules, e.g. the rule for `del_stmt` must be tried before the rule for `expression_stmt` since `del *x` would match the latter.)
What kind of newcomers do you have that they even notice that, unless you were to draw attention to it? I'm serious -- from the kind of questions I've seen in user forums, most newcomers are having a hard enough time learning more fundamental concepts and abstractions than the precise rules for reserved words. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Tue, Apr 7, 2020 at 5:03 AM Guido van Rossum <guido@python.org> wrote:
From my experience of teaching a variety of languages, including SQL, it's usually not something people have a problem with in toy examples - but it becomes a major nuisance when they're trying to deal with a problem and some keyword is getting in the way. SQL is *full* of context-sensitive keywords, and every once in a while, someone uses a non-reserved word as a column name, and everything works until they run into some specific context where it doesn't work. (It's a bit messier than in Python due to multiple abstraction layers eg ORMs, and sometimes they deal with these issues and sometimes not; but it's still that much harder to debug specifically _because_ things aren't always reserved.) Ultimately it comes down to the number of edge cases that people have to learn, and how edgy those cases are. Python already has the possibility to override builtins, so you can say "list = []" without an error; context-sensitive keywords sit in a space between those and fully-reserved words. It'll come down to specific words as to whether it's inevitably going to be a problem down the track, or almost certainly going to be fine. BTW, is the PEG parser going to make it easier to hack on the language syntax? If so, it'd be that much easier to experiment with these kinds of ideas in a separate branch/fork, and quickly find out if there's going to be any major impact. At the moment, editing the grammar is a bit daunting - too many easy ways to mess it up. ChrisA

On Mon, Apr 6, 2020 at 8:04 PM Guido van Rossum <guido@python.org> wrote:
Absolutely. Beginners can simply be told they are keywords. If they then come across them in other contexts, hopefully there'll be a sensible documentation page that a web search for "<keyword> keyword" would lead to an explanation that "Some Python keywords can only ever be used with that meaning. Others can be used with other meanings where the context makes it clear that the keyword interpretation does not apply. You are recommended not to use such keywords as names in your own programs. The feature was implemented to make porting existing code to future versions of Python simpler." The tutorial should contain a similar passage.

On Mon, Apr 06, 2020 at 11:54:54AM -0700, Guido van Rossum wrote:
I think, on first glance, I'd rather have all keywords context-sensitive than just some. But I haven't put a great deal of thought into that aspect of it, and I reserve the right to change my mind :-)
It didn't take me 25 years to try using "of" and "if" for "output file" and "input file", so I guess my answer to your question is ordinary newcomers :-) "Newcomers" doesn't just including beginners to programming, it can include people experienced in one or more other language coming to Python for the first time. But if we're talking about complete beginners, the concept of what is and isn't a keyword is not always clear. Why is the first of these legal but not the second? Both words are highlighted in my editor: str = "Hello world" class = "wizard" People are going to learn that `match` is a keyword, and then they are going to come across code using it as a variable or method, and while the context-sensitive rule might be obvious to us, it won't be obvious to them precisely because they are still learning the language rules. I think that `match` would be an especially interesting case because I can easily see someone starting off with a variable `match`, that they handle in an `if` statement, and then as the code evolves they shift it to a `match` statement: match match: and not bother to refactor the name because they are familiar enough with is that the meaning is obvious. On the other hand there are definitely a few keywords that collide with useful names. Apart from `if`, I have wanted to use these as variables, parameters or functions: class, raise, in, from, except, while, lambda (off the top of my head, there may be others). There's at least one place in the random module where a parameter is misspelled "lambd" because lambda is a keyword. So there is certainly something to be said for getting rid of keywords. On the third hand, keywords don't just make it easier for the interpreter, they also make it easier for the human reader. You don't need to care about context, `except` is `except` wherever you see it. That makes it a dead-simple rule for anyone to learn, because there are no exceptions, pun intended. (I guess inside strings and comments are exceptions, but they are well-understood and *simple* exceptions.) I just can't help feeling at this point that while there are pros and cons to making things a keyword, having some keywords be context sensitive but not others is going to combine the worst of both and end up be confusing and awkward.
That's because the precise rules for reserved words are dead-simple to learn. You can't use them anywhere except in the correct context. If we start adding exceptions to that, that reserved words are only sometimes reserved, I think that will make them harder to learn. If it's only some reserved words but not others, that's even harder because we have three classes of words: * words that are never reserved * words that are sometimes reserved, depending on what is around them * words that are always reserved I had thought that "no context-sensitive keywords" was a hard rule, so I was surprised that you are now re-considering it. -- Steven

After 30 years am I not allowed to take new information into account and consider a change of heart? :-) On Mon, Apr 6, 2020 at 6:21 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Another point in favour of always-reserved keywords is that they make life a lot easier for syntax highlighters. -- Greg

On 7/04/20 6:54 am, Guido van Rossum wrote:
I don't see it as an arbitrary rule, or at least no more arbitrary than any other language rule. Given that the rule exists, it's the exception that seems arbitrary. There's little justification for it other than "we only thought of using it as a keyword later". To reduce arbitrariness, we would either have to make *all* keywords context-sensitive, or come up with some principled way of deciding whether a given keyword should be reserved or not. -- Greg

On 7/04/20 5:43 am, Guido van Rossum wrote:
I'm still inclined to think that allowing ambiguous grammars is more of a bug than a feature. Is there some way the generator could be made to at least warn if the grammar is genuinely ambiguous (as opposed to just having overlapping first sets in alternatives)?
We don't specify how other implementations must parse the language
And this is one of the reasons. If we use a PEG grammar as the definition of the language, and aren't careful about ambiguities when we add new syntax, we might accidentally end up with something that can *only* be parsed with a PEG parser or something equally powerful.
I'm sure there will be other ways to parse the same language.
That's certainly true now, but can you be sure it will remain true if additions are made that rely on the full power of PEG? -- Greg

After the feedback received in the language summit, we have made a modification to the proposed migration plan in PEP 617 so the new parser will be the default in 3.9alpha6: https://github.com/python/peps/pull/1369

The PEP is exciting and is very clearly presented, thank you all for the hard work! Considering the comments in the PEP about the new parser not preserving a parse tree or CST, I have some questions about the future options for Python language-services tooling which requires a CST in order to round-trip and modify Python code. Examples in this space include auto-formatters, refactoring tools, linters with autofix, etc. Today many such tools (e.g. Black, 2to3) are based on lib2to3. Other tools already have their own parser (e.g. LibCST -- which I help maintain -- and Jedi both use parso, a fork of pgen2). 1) 2to3 and lib2to3 are not mentioned in the PEP, but are a documented part of the standard library used by some very popular tools, and currently depend on pgen2. A quick search of the PEP 617 pull request does not suggest that it modifies lib2to3. Will lib2to3 also be removed in Python 3.10 along with the old parser? It might be good for the PEP to address the future of 2to3 and lib2to3 explicitly. 2) As these tools make the necessary adaptations to support Python 3.10, which may no longer be parsable with an LL(1) parser, will we be able to leverage any part of pegen to construct a lossless Python CST, or will we likely need to fork pegen outside of CPython or build a wholly new parser? It would be neat if an alternate grammar could be written in pegen that has access to all tokens (including NL and COMMENT) for this purpose; that would save a lot of code duplication and potential for inconsistency. I haven't had a chance to fully read through the PEP 617 pull request, but it looks like its tokenizer wrapper currently discards NL and COMMENT. I understand this is a distinct use case with distinct needs and I'm not suggesting that we should make significant sacrifices in the performance or maintainability of pegen to serve it, but if it's possible to enable some sharing by making API choices now before it's merged, that seems worth considering. Carl

On Sat, Apr 18, 2020 at 4:53 PM Carl Meyer <carl@oddbird.net> wrote:
Right, LibCST is very exciting. Note that AFAIK none of the tools you mention depend on the old parser module. (Though I'm not denying that there might be tools depending on it -- that's why we're keeping it until 3.10.)
Note that, while there is indeed a docs page about 2to3 <https://docs.python.org/3/library/2to3.html>, the only docs for *lib2to3* in the standard library reference are a link to the source code and a single "*Note:* The lib2to3 <https://docs.python.org/3/library/2to3.html?highlight=lib2to3#module-lib2to3> API should be considered unstable and may change drastically in the future." Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)
You've mentioned a few different tools that already use different technologies: LibCST depends on parso which has a fork of pgen2, lib2to3 which has the original pgen2. I wonder if this would be an opportunity to move such parsing support out of the standard library completely. There are already two versions of pegen, but neither is in the standard library: there is the original pegen <https://github.com/gvanrossum/pegen/> repo which is where things started, and there is a fork of that code in the CPython Tools <https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator> directory (not yet in the upstream repo, but see PR 19503 <https://github.com/python/cpython/pull/19503>). The pegen tool has two generators, one generating C code and one generating Python code. I think that the C generator is really only relevant for CPython itself: it relies on the builtin tokenizer (the one written in C, not the stdlib tokenize.py) and the generated C code depends on many internal APIs. In fact the C generator in the original pegen repo doesn't work with Python 3.9 because those internal APIs are no longer exported. (It also doesn't work with Python 3.7 or older because it makes critical use of the walrus operator. :-) Also, once we started getting serious about replacing the old parser, we worked exclusively on the C generator in the CPython Tools directory, so the version in the original pegen repo is lagging quite a bit behind (is is the Python grammar in that repo). But as I said you're not gonna need it. On the other hand, the Python generator is designed to be flexible, and while it defaults to using the stdlib tokenize.py tokenizer, you can easily hook up your own. Putting this version in the stdlib would be a mistake, because the code is pretty immature; it is really waiting for a good home, and if parso or LibCST were to decide to incorporate a fork of it and develop it into a high quality parser generator for Python-like languages that would be great. I wouldn't worry much about the duplication of code -- the Python generator in the CPython Tools directory is only used for one purpose, and that is to produce the meta-parser (the parser for grammars) from the meta-grammar. And I would happily stop developing the original pegen once a fork is being developed. Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought <https://github.com/we-like-parsers/cpython/issues/84> about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this. Finally, I should recognize the important influence of my mentor in PEG parsing, Juancarlo Añez <https://github.com/apalala/>. Without his early encouragement and advice I would never have been able to travel this road. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sat, Apr 18, 2020 at 10:38 PM Guido van Rossum <guido@python.org> wrote:
Note that, while there is indeed a docs page about 2to3, the only docs for lib2to3 in the standard library reference are a link to the source code and a single "Note: The lib2to3 API should be considered unstable and may change drastically in the future."
Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)
Indeed! Thanks for clarifying, I now recall that I already knew it doesn't, but forgot. The docs page for 2to3 does currently say "lib2to3 could also be adapted to custom applications in which Python code needs to be edited automatically." Perhaps at least this sentence should be removed, and maybe also replaced with a clearer note that lib2to3 not only has an unstable API, but also should not necessarily be expected to continue to parse future Python versions, and thus building tools on top of it should be discouraged rather than recommended. (Maybe even use the word "deprecated.") Happy to submit a PR for this if you agree it's warranted. It still seems to me that it wouldn't hurt for PEP 617 itself to also mention this shift in lib2to3's effective status (from "available but no API stability guarantee" to "probably will not parse future Python versions") as one of its indirect effects.
Thanks, this is all very clarifying! I hadn't even found the original gvanrossum/pegen repo, and was just looking at the CPython PR for PEP 617. Clearly I haven't been following this work closely.
Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this.
This seems like the place to start. When we start work on Python 3.10 support for LibCST, we can start with trying to use and adapt pegen in place of the vendored fork of parso we currently use, and if that's promising enough, consider taking over maintenance of it. Carl

Great! Please submit a PR to update the [lib]2to3 docs and CC me (@gvanrossum). While perhaps it wouldn't hurt if the PEP mentioned lib2to3, it was just accepted by the Steering Council without such language, and I wouldn't want to imply that the SC agrees with everything I said. So I still think we ought to deal with lib2to3 independently (and no, it won't need its own PEP :-). A reasonable option would be to just deprecate it and recommend people use parso, LibCST or something else (I wouldn't recommend pegen in its current form yet). On Tue, Apr 21, 2020 at 6:21 PM Carl Meyer <carl@oddbird.net> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Could we go ahead and mark lib2to3 as Pending Deprecation in 3.9 so we can get it out of the stdlib by 3.11 or 3.12? lib2to3 is the basis of all sorts of general source code manipulation tooling. Its name and original reason d'etre have moved on. It is actively used to parse and rewrite Python 3 code all the time. yapf uses it, black uses a fork of it. Other Python code manipulation tooling uses it. Modernize like fixers are useful for all sorts of cleanups. IMNSHO it would be better if lib2to3 were *not* in the stdlib anymore - Black already chose to fork lib2to3 <https://github.com/psf/black/tree/master/blib2to3>. So given that it is eventually not going to be able to parse future syntax, the better answer seems like deprecation, putting the final version up on PyPI and letting any descendants of it live on PyPI where they can get more active care than a stdlib module ever does. -gps On Tue, Apr 21, 2020 at 6:58 PM Guido van Rossum <guido@python.org> wrote:

On Tue, Apr 21, 2020 at 9:35 PM Gregory P. Smith <greg@krypto.org> wrote:
Could we go ahead and mark lib2to3 as Pending Deprecation in 3.9 so we can get it out of the stdlib by 3.11 or 3.12?
I'm going ahead and tracking the idea in https://bugs.python.org/issue40360.

Hi Guido, Pablo & Lysandros, I'm excited about this improvement to Python, and was interested to hear about it at the language summit as well. I happen to be friends with Alessandro Warth, whom you cited in the PEP as developing the packrat parsing technique you use (at least in part). I wrote to him to ask if he knew being cited, and he responded in part with these comments. The additional link may perhaps be useful for you: Alex: (If they had gotten in touch, I would have pointed them at my
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.

On Thu, 2 Apr 2020 at 19:20, Guido van Rossum <guido@python.org> wrote:
Excellent news! One question - will there be any user-visible change as a result of this PEP other than the removal of the "parser" module? From my quick reading of the PEP, I didn't see anything, so I assume the answer is "no". Paul

On Thu, Apr 2, 2020 at 12:43 PM Paul Moore <p.f.moore@gmail.com> wrote:
I suppose it depends on how deep you dig, but the intention is that the returned AST is identical in each case. (We've "cheated" a bit by making a few small changes to the code that produces an AST for the old parser, mostly bugs related to line/column numbers.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Great to see this new work pay off! On Apr 2, 2020, at 11:10, Guido van Rossum <guido@python.org> wrote:
2. After Python 3.9 Beta 1 the default parser will be the new parser.
Just to clarify, this means that 3.9 will ship with the PEG parser as default, right? If so, this would be a new feature, post beta. Since that is counter to our general policy, we would need to get explicit RM approval for such a change. Cheers, -Barry

On Thu, Apr 2, 2020 at 1:21 PM Barry Warsaw <barry@python.org> wrote:
That was the intention, i.e. releasing beta 1 with the new parser being the default. The current wording in the PEP are wrong, we'll fix that. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Hi, It's great to see that you finally managed to come up with a PEP, this work becomes concrete: congrats! I started to read the PEP, and it's really well written! I heard that LL(1) parsers have limits, but this PEP explains very well that the current Python grammar was already "hacked" to work around these limitations. I also like the fact that PEG is deterministic, whereas LL(1) parsers are not. I like to have the new parser being the default, it will ease its adoption and force users to adapt their code. Otherwise, the migration may take forever and never complete :-( -- About the migration, can I ask who is going to (help to) fix projects which rely on the AST? I know that the motto was always "we don't provide any backward compatibility warranty on the AST", *but* more and more projects are using the Python AST. Examples of projects relying on the AST: * gast: used by Pythran * pylint uses astroid * Chameleon * Genshi * Mako * pyflakes * (likely others) I'm not asking to stop making AST changes. I'm following AST changes, and the AST is becoming better and better at each Python release! I'm just asking is there are volunteers around to help to make these projects compatible with Python 3.9, before the Python 3.9.0 final release (to accelerate the adoption of Python 3.9). These volunteers don't have to be the ones behind the PEP 617. Note: example of previous AST incompatible changes (use ast.Constant, remove old AST classes) in Python 3.8: https://bugs.python.org/issue32892 A compatibility layer was added to ease the migration from old AST classes to the new ast.Constant. Victor Le jeu. 2 avr. 2020 à 20:15, Guido van Rossum <guido@python.org> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

About the migration, can I ask who is going to (help to) fix projects which rely on the AST?
I think you misunderstood: The AST is exactly the same as the old and the new parser. The only the thing that the new parser does is not generate an immediate CST (Concrete Syntax Tree) and that is only half-exposed in the parser module.

On Thu, Apr 2, 2020 at 2:48 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
If the AST is supposed to be the same, then would it make sense to temporarily – maybe just during the alpha/beta period – always run *both* parsers and confirm that they match? -n -- Nathaniel J. Smith -- https://vorpus.org

On Thu, Apr 2, 2020 at 4:20 PM Nathaniel Smith <njs@pobox.com> wrote:
That's not a bad idea! https://github.com/we-like-parsers/cpython/issues/33 -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 02, 2020 at 05:17:31PM -0700, Guido van Rossum wrote:
Even just running it in a dev build against the corpus of the top few thousand packages on pypi might give enough confidence -- I had a script to download the top N packages and run some script over the python files contained therein, but I can't seem to find it atm. m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

On Thu, Apr 2, 2020 at 7:55 PM Matt Billenstein <matt@vazor.com> wrote:
We got that. Check https://github.com/gvanrossum/pegen/tree/master/scripts -- look at download_pypi_packages.py and test_pypi_packages.py. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 02, 2020 at 08:57:30PM -0700, Guido van Rossum wrote:
Very nice! m -- Matt Billenstein matt@vazor.com http://www.vazor.com/

About the migration, can I ask who is going to (help to) fix projects which rely on the AST?
Whoops, I send the latest email before finishing it by mistake. Here is the extended version of the answer: I think there is a misunderstanding here: The new parser generates the same AST as the old parser so calling ast.parse() or compile() will yield exactly the same result. We have extensive testing around that and that was a goal from the beginning. Projects using the ast module will not need to do anything special. The difference is that the new parser does not generate a CST (Concrete Syntax Tree). The concrete syntax tree is an immediate structure from where the AST is generated. This structure is only partially exposed via the "parser" module but otherwise is only used in the parser itself so it should not be a problem. On the other hand: as explained in the PEP, the lack of the CST greatly simplifies the AST generation among other advantages.

On Thu, Apr 2, 2020 at 2:55 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
I think that's only half true. It's true if they already work with Python 3.9 (master/HEAD). But probably some of these packages have not yet started testing with 3.9 nightly runs or even alphas, so it's at least *conceivable* that some of the fixes we applied to the AST could require (small) adjustments. And I think *that* was what Victor was referring to. (For example, I'm not 100% sure that mypy actually works with the latest 3.9. But there seems to be something else wrong there so I can't even test it.)
I just remembered another difference. We haven't really investigated how good the error reporting is. I'm sure there are cases where the syntax error points at a *slightly* different position -- sometimes it's a bit better, sometimes a bit worse. But there could be cases where the PEG parser reads ahead chasing some alternative that will fail much later, and then it would be much worse. We should probably explore this. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Sorry, I was referring to *ambiguous* grammar rules. Extract of the PEP: "Unlike LL(1) parsers PEG-based parsers cannot be ambiguous: if a string parses, it has exactly one valid parse tree. This means that a PEG-based parser cannot suffer from the ambiguity problems described in the previous section." Victor Le ven. 3 avr. 2020 à 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

Le ven. 3 avr. 2020 à 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
On Thu, Apr 2, 2020 at 6:15 PM Victor Stinner <vstinner@python.org> wrote:
Maybe we need to rephrase this a bit. It's more that the LL(1) and PEG formalisms deal very different with ambiguous *grammars*. An example of an ambiguous grammar would be: start: X | Y X: expr Y: expr expr: NAME | NAME '+' NAME There are probably better examples of ambiguous grammars (see https://en.wikipedia.org/wiki/Ambiguous_grammar) but I think this will do to explain the problem. This is a fine context-free grammar (it accepts strings like "a" and "a+b") but the LL(1) formalism will reject it because it sees an overlap in FIRST sets between X and Y -- not surprising because they have the same RHS. Also, even a more powerful formalism would have to make a choice whether to choose X or Y, which may matter if the derivation is used to build a parse tree (like Python's pgen does). OTOH a PEG parser generator will always take the X alternative -- it doesn't care that there's more than one derivation, since its '|' operator is not symmetrical: X|Y and Y|X are not the same, as they are in LL(1) and most other formalisms. (In fact, the common notation for PEG uses '/' to emphasize this, but it looks ugly to me so I changed it to '|'.) That PEG (by definition) always uses the first matching alternative is actually a blessing as well as a curse. The downside is that PEG can't tell you when have a real ambiguity in your grammar. But the upside is that it works like a programmer would write a (recursive descent) parser. Thus it "solves" the problem of ambiguous grammars by choosing the first alternative. This allows more freedom in designing a grammar. For example, it would let a language designer solve the "dangling else" problem from the Wikipedia page, by writing the form including the "else" clause first . (Python doesn't have that problem due to the use of indentation, but it might appear in another disguise.) I should probably refine this argument and include it in the PEP as one of the reasons to prefer PEG over LR or LALR (but I need to think more about that -- it was a very early choice). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 3/04/20 3:22 pm, Guido van Rossum wrote:
I'm inclined to think that such problems shouldn't be solved at the parser level, but rather at the language level, i.e. don't design the language that way in the first place. After all, if it's confusing to the computer, it's probably going to be confusing to humans as well. (I note that all of Wirth's languages after Pascal changed the syntax so as not to have a dangling else problem.) Personally I would rather my parser generator *did* complain about ambiguities, so that I can facepalm myself for designing my language in such a stupid way. -- Greg

On 3/04/20 2:13 pm, Victor Stinner wrote:
That paragraph seems rather confused. I think what it *might* be trying to say is that a PEG parser allows you to write productions with overlapping first sets (which would be "ambiguous" for an LL parser), but still somehow guarantees that a unique parse tree is produced. The latter suggests that the grammar as a whole still needs to be unambiguous. -- Greg

We may need to rephrase this to make it a bit more clear, but this is trying to say that PEG grammars cannot be ambiguous in the same sense as context-free grammars are normally said to be ambiguous. Notice that an ambiguous grammar is normally defined (for instance here https://en.wikipedia.org/wiki/Ambiguous_grammar) only for context-free grammars as a grammar with more than one possible parse tree. In the PEG formalism as Guido explained in the previous email there is only one possible parse tree because the parser always chooses the first option. As a consequence of this (and as a particular case of this) and as you mention, the PEG formalism allows writing productions with overlapping first sets. Also, notice that first sets are mainly relevant for LL(k) parsers and the like because those need to *deduce* which alternative to follow given multiple choices in production while PEG will always try in order. In general, the argument is that because of how PEG works, it will only be one parse tree and this makes the grammar "not ambiguous" under the typical definition for ambiguity for context-free grammars (having multiple parse trees).

On 3/04/20 7:10 am, Guido van Rossum wrote:
Was any consideration given to other types of parser, such as LR or LALR? LR parsers handle left recursion naturally, and don't suffer from any of the drawbacks mentioned in the PEP such as taking exponential time or requiring all the source to be loaded into memory. I think there needs to be a section in the PEP justifying the choice of PEG over the alternatives. -- Greg

On 4/04/20 9:29 am, Brett Cannon wrote:
I think "needs" is a bit strong. It would be nice, though. Regardless, as long as this is a net improvement over the status quo I don't see this being rejected on the grounds that an LR or LALR parser would be better since we have a working PEG parser today. :)
Even if the section only says "We didn't consider any alternatives, because...", I still think it should be there. -- Greg

Thanks, Guido, Pablo, Lysandros, that's a great PEP. Also thanks to everyone else working on the PEG parser over the last year, like Emily. I know it's a lot of work but as someone who's intimately aware of the headaches caused by the LL(1) parser, I greatly appreciate it :). The only thing I'm missing from the PEP is more detail about how the cross-language nature of the parser actions are handled. The example covers just C, and the description of the actions says they're C expressions. The only mention of Python code generation is for alternatives without actions. Is the intent that the actions are cross-language, or translated to Python somehow, or is the support for generating a Python-based parser merely for debugging, as that action suggests? -- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!

Oh, good point. Thanks for pointing that out. We certainly need to explain that a bit better. The current situation is that actions support both Python and C code. They are basically pieces of code that will be included in the resulting program, no matter on what language is written in. For instance, we use the Python generator to generate the code that parses the grammar for the generator itself. The output is written in Python and the metagrammar uses actions written in Python: https://github.com/we-like-parsers/cpython/blob/pegen/Tools/peg_generator/pe... So regarding the usage of Python code generation, is certainly useful for debugging but is actually used by the generator itself to bootstrap a section of it (the one that parses grammars). The feeling of bootstrapping parsers never gets old and is one of the most fun parts to do :) I will prepare a PR soon to complement the section about actions in the PEP.

The only thing I'm missing from the PEP is more detail about how the cross-language nature of the parser actions are handled.
Expanded the "actions" section in the PEP here: https://github.com/python/peps/pull/1357

The tl;dr is that actions specified in the grammar are specific to the target language. So if you want to use the pegen tool to generate both Python and C code for the same grammar, you would need two grammar files with the same grammar but different actions. Since our goal here is just to generate a parser for use in CPython that's not a problem. Other PEG parser generators make different choices, e.g. TatSu puts semantics actions in a separate file (https://tatsu.readthedocs.io/en/stable/semantics.html). On Sun, Apr 5, 2020 at 11:06 AM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 6/04/20 2:08 am, Jelle Zijlstra wrote:
And related to that, how precisely will it be able to pinpoint the location of the error? The backtracking worries me a bit in that regard. I can imagine it trying all possible ways to parse the input and then only being able to say "Something is wrong somewhere in this file." -- Greg

Unfortunately they look pretty much the same. We're actually currently trying to improve the error messages for situations where the old parser produces something specialized (mostly because the LL(1) grammar can't express something and the check is done in a later pass).
There's no need to worry about this: in almost all cases the error indicator points to the same spot in the source code as with the old parser. I was worried about this too, but it really doesn't seem to be a problem -- I think this might be different with highly ambiguous grammars, but since Python's grammar is still *mostly* LL(1), it looks like we're fine. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 6/04/20 4:48 am, Guido van Rossum wrote:
I'm curious about how that works. From the description in the PEP, it seems that none of the individual parsing functions can report an error, because there might be another branch higher up that succeeds. Does it keep track of the maximum distance it got through the source or something like that? -- Greg

On Sun, Apr 5, 2020 at 5:16 PM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I guess you could call it that. There is a small layer of abstraction between the actual tokenizer (which cannot go back) and the generated parser functions. This abstraction buffers tokens. When a parser function wants a token it calls into this abstraction, and that either satisfies it from its buffer, or if there is no lookahead in the buffer left, calls the actual tokenizer. When a parser function fails, it calls into the abstraction layer to back up to a previous point (which I call the "mark"). (A simplified version of this layer is shown in my blog post, https://medium.com/@gvanrossum_83706/building-a-peg-parser-d4869b5958fb -- the class Tokenizer.) When an error bubbles all the way up, we report a SyntaxError pointing to the farthest token that the abstraction has buffered (self.pos in the blog post). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

The PEP gives a good exposition of the problem and proposed solution, thanks. If I understand correctly, the proposal is that the PEG grammar should become the definitive grammar for Python at some point, probably for Python 3.10, so it may evolve without the LL(1) restrictions. I'd like to raise some points with respect to that, which perhaps the migration section could answer. When definitive, the grammar would not then just be for CPython, and would also appear as user documentation of the language. Whether that change leaves Python with a more useful (readable) grammar seems an important test of the idea. I'm looking at https://github.com/we-like-parsers/cpython/blob/pegen/Grammar/python.gram , and assuming that is indicative of a future definitive grammar. That may be incorrect, as it has these issues in my view: 1. It is decorated with actions in C. If a decorated grammar is offered as definitive, one with Python actions (operations on the AST) is preferable, as implementation neutral, although still hostage to AST changes that are not language changes. Maybe one stripped of actions is best. 2. It's quite long, and not at first glance more readable than the LL(1) grammar. I had understood ugliness in the LL(1) grammar to result from skirting limitations that PEG eliminates. The PEG one is twice as long, but recognising about half of it is actions, let's just say that as a grammar it's no shorter. 3. There is some manual guidance by means of &-guards, only necessary (I think) as a speed-up or to force out meaningful syntax errors. That would be noise to the reader. (This goes away if the PEG parser generator generate guards from the first set at a simple "no backtracking" marker.) 4. In some places, expansive alternatives seem to be motivated by the difference between actions, for a start, wherever async pops up. Maybe it is also why the definition of lambda is so long. That could go away with different support code (e.g. is_async as an argument), but if improvements to the support change grammar rules, when the language has not changed, that's a danger sign too. All that I think means that the "operational" grammar from which you build the parser is going to be quite unlike the one with which you communicate the language. At present ~/Grammar/Grammar both generates the parser (I thought) and appears as documentation. I take it to be the ideal that we use a single, human-readable definition. For example ANTLR 4 has worked hard to facilitate a grammar in which actions are implicit, and the generation of an AST from the parse tree/events can be elsewhere. (I'm not plugging ANTLR specifically as a solution.) Jeff Allen On 02/04/2020 19:10, Guido van Rossum wrote:

On Mon, Apr 6, 2020 at 5:18 AM Jeff Allen <ja.py@farowl.co.uk> wrote:
Thanks, you definitely have a point here.
Yes, the plan is to strip actions and a few other embellishments (types, names, cuts, and probably also lookaheads -- although the latter may be significant, we only use them for optimization). The parser generator ( https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator) prints a stripped representation (though currently preserving lookaheads -- suppressing those would be a simple change to the code).
Indeed. I believe part of this actually comes from the desire to be 100% compatible with the old parser (an important constraint is that we don't want to change the AST since we don't want to change the byte code generator). Another part of it comes from expressing in the grammar constraints that the old parser generator cannot express. For example, the old parser accepts `1 = x` as an assignment, and it is rejected in a later stage. The new parser expresses this restriction in the grammar. Note that the full grammar published in the reference manual ( https://docs.python.org/3.8/reference/grammar.html) doesn't say anything about this; the grammar used later to describe assignment_stmt does ( https://docs.python.org/3.8/reference/simple_stmts.html#grammar-token-assign...), but as a result it is not LL(1) -- those grammar sections sprinkled throughout the reference manual are all written and updated by hand (and sometimes we forget!).
Yeah, see above. We've thought of generating FIRST sets as a future enhancement of the generator, and then they can go away. At the moment the lookaheads we have are all carefully aimed at optimizing the time and space requirements of the parser.
Yeah, lambda is complicated by the requirement on the generated AST. Arguably we have gone too far here (and for 'parameters', which solves almost the same problem for regular function definitions) and we should put some of the checks back in the support code. But I note that the old grammar also has some warts in the area of parameter definitions (though its lambda is definitely simpler).
Our cheaper solution is to remove the actions from the display grammar. But I don't think that Grammar/Grammar should be seen as a complete specification of the language. And I don't think it is terrible if the specification says function_def_raw: | ASYNC 'def' NAME '(' parameters? ')' ['->' annotation] ':' block | 'def' NAME '(' parameters? ')' ['->' annotation] ':' block instead of function_def_raw: [ASYNC] 'def' NAME '(' parameters? ')' ['->' annotation] ':' block -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Thu, Apr 2, 2020 at 3:16 PM Guido van Rossum <guido@python.org> wrote:
Hi Guido, I think using a PEG parser is interesting, but I do have some questions related to what's to expect in the future for other people which have to follow the Python grammar, so, can you shed some light on this? Does that mean that the grammar format currently available (which is currently specified in https://docs.python.org/3.8/reference/grammar.html) will no longer be updated/used? Is it expected that other language implementations/parsers also have to move to a PEG parser in the future? -- which would probably be the case if the language deviates strongly off LL(1) Thanks, Fabio

On Mon, Apr 6, 2020 at 4:03 AM Fabio Zadrozny <fabiofz@gmail.com> wrote:
The grammar format used for the PEG parser is nearly the same as the old grammar, when you remove actions and some embellishments needed for actions. The biggest difference is that the `|` operator is no longer symmetrical (since if you have alternatives `A | B`, and both match at some point in the input, PEG reports A, while the old generator would reject the grammar as being ambiguous.
We don't specify how other implementations must parse the language -- in fact I have no idea how the parsers of any of the other implementations work. I'm sure there will be other ways to parse the same language. But yeah, if there are implementations that currently closely follow Python's LL(1) parser structure they may have to be changed once we start introducing new syntax that makes use of the freedom PEG gives us. (For example, I've been toying with the idea of introducing a "match" statement similar to Scala's match expression by making "match" a keyword only when followed by an expression and a colon.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Mon, Apr 06, 2020 at 10:43:11AM -0700, Guido van Rossum wrote:
Didn't we conclude from `as` that having context-sensitive keywords was a bad idea? Personally, I would not like to have to explain to newcomers why `match` is a keyword but you can still use it as a function or variable, but not other keywords like `raise`, `in`, `def` etc. match expression: match = True -- Steven

On Mon, Apr 6, 2020 at 11:36 AM Steven D'Aprano <steve@pearwood.info> wrote:
I'm not sure that that was the conclusion. At the time the point was that we *wanted* all keywords to be reserved everywhere, an `as` was an ugly exception to that rule, which we got rid of as soon as we could -- not because it was a bad idea but because it violated a somewhat arbitrary rule. We went through the same thing with `async` and `await`, and the experience there was worse: a lot of libraries in the very space `async def` was aimed at were using `async` as a parameter name, often in APIs, and they had to scramble to redesign their APIs and get their users to change their programs. In retrospect I wish we had just kept `async` as a context-sensitive keyword, since it was totally doable. (In an early version of the PEG parser, all keywords were context-sensitive, and there were only very few places in the grammar where this required us to insert negative lookaheads to make edge cases parse correctly. The rest was taken care by careful ordering of rules, e.g. the rule for `del_stmt` must be tried before the rule for `expression_stmt` since `del *x` would match the latter.)
What kind of newcomers do you have that they even notice that, unless you were to draw attention to it? I'm serious -- from the kind of questions I've seen in user forums, most newcomers are having a hard enough time learning more fundamental concepts and abstractions than the precise rules for reserved words. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Tue, Apr 7, 2020 at 5:03 AM Guido van Rossum <guido@python.org> wrote:
From my experience of teaching a variety of languages, including SQL, it's usually not something people have a problem with in toy examples - but it becomes a major nuisance when they're trying to deal with a problem and some keyword is getting in the way. SQL is *full* of context-sensitive keywords, and every once in a while, someone uses a non-reserved word as a column name, and everything works until they run into some specific context where it doesn't work. (It's a bit messier than in Python due to multiple abstraction layers eg ORMs, and sometimes they deal with these issues and sometimes not; but it's still that much harder to debug specifically _because_ things aren't always reserved.) Ultimately it comes down to the number of edge cases that people have to learn, and how edgy those cases are. Python already has the possibility to override builtins, so you can say "list = []" without an error; context-sensitive keywords sit in a space between those and fully-reserved words. It'll come down to specific words as to whether it's inevitably going to be a problem down the track, or almost certainly going to be fine. BTW, is the PEG parser going to make it easier to hack on the language syntax? If so, it'd be that much easier to experiment with these kinds of ideas in a separate branch/fork, and quickly find out if there's going to be any major impact. At the moment, editing the grammar is a bit daunting - too many easy ways to mess it up. ChrisA

On Mon, Apr 6, 2020 at 8:04 PM Guido van Rossum <guido@python.org> wrote:
Absolutely. Beginners can simply be told they are keywords. If they then come across them in other contexts, hopefully there'll be a sensible documentation page that a web search for "<keyword> keyword" would lead to an explanation that "Some Python keywords can only ever be used with that meaning. Others can be used with other meanings where the context makes it clear that the keyword interpretation does not apply. You are recommended not to use such keywords as names in your own programs. The feature was implemented to make porting existing code to future versions of Python simpler." The tutorial should contain a similar passage.

On Mon, Apr 06, 2020 at 11:54:54AM -0700, Guido van Rossum wrote:
I think, on first glance, I'd rather have all keywords context-sensitive than just some. But I haven't put a great deal of thought into that aspect of it, and I reserve the right to change my mind :-)
It didn't take me 25 years to try using "of" and "if" for "output file" and "input file", so I guess my answer to your question is ordinary newcomers :-) "Newcomers" doesn't just including beginners to programming, it can include people experienced in one or more other language coming to Python for the first time. But if we're talking about complete beginners, the concept of what is and isn't a keyword is not always clear. Why is the first of these legal but not the second? Both words are highlighted in my editor: str = "Hello world" class = "wizard" People are going to learn that `match` is a keyword, and then they are going to come across code using it as a variable or method, and while the context-sensitive rule might be obvious to us, it won't be obvious to them precisely because they are still learning the language rules. I think that `match` would be an especially interesting case because I can easily see someone starting off with a variable `match`, that they handle in an `if` statement, and then as the code evolves they shift it to a `match` statement: match match: and not bother to refactor the name because they are familiar enough with is that the meaning is obvious. On the other hand there are definitely a few keywords that collide with useful names. Apart from `if`, I have wanted to use these as variables, parameters or functions: class, raise, in, from, except, while, lambda (off the top of my head, there may be others). There's at least one place in the random module where a parameter is misspelled "lambd" because lambda is a keyword. So there is certainly something to be said for getting rid of keywords. On the third hand, keywords don't just make it easier for the interpreter, they also make it easier for the human reader. You don't need to care about context, `except` is `except` wherever you see it. That makes it a dead-simple rule for anyone to learn, because there are no exceptions, pun intended. (I guess inside strings and comments are exceptions, but they are well-understood and *simple* exceptions.) I just can't help feeling at this point that while there are pros and cons to making things a keyword, having some keywords be context sensitive but not others is going to combine the worst of both and end up be confusing and awkward.
That's because the precise rules for reserved words are dead-simple to learn. You can't use them anywhere except in the correct context. If we start adding exceptions to that, that reserved words are only sometimes reserved, I think that will make them harder to learn. If it's only some reserved words but not others, that's even harder because we have three classes of words: * words that are never reserved * words that are sometimes reserved, depending on what is around them * words that are always reserved I had thought that "no context-sensitive keywords" was a hard rule, so I was surprised that you are now re-considering it. -- Steven

After 30 years am I not allowed to take new information into account and consider a change of heart? :-) On Mon, Apr 6, 2020 at 6:21 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Another point in favour of always-reserved keywords is that they make life a lot easier for syntax highlighters. -- Greg

On 7/04/20 6:54 am, Guido van Rossum wrote:
I don't see it as an arbitrary rule, or at least no more arbitrary than any other language rule. Given that the rule exists, it's the exception that seems arbitrary. There's little justification for it other than "we only thought of using it as a keyword later". To reduce arbitrariness, we would either have to make *all* keywords context-sensitive, or come up with some principled way of deciding whether a given keyword should be reserved or not. -- Greg

On 7/04/20 5:43 am, Guido van Rossum wrote:
I'm still inclined to think that allowing ambiguous grammars is more of a bug than a feature. Is there some way the generator could be made to at least warn if the grammar is genuinely ambiguous (as opposed to just having overlapping first sets in alternatives)?
We don't specify how other implementations must parse the language
And this is one of the reasons. If we use a PEG grammar as the definition of the language, and aren't careful about ambiguities when we add new syntax, we might accidentally end up with something that can *only* be parsed with a PEG parser or something equally powerful.
I'm sure there will be other ways to parse the same language.
That's certainly true now, but can you be sure it will remain true if additions are made that rely on the full power of PEG? -- Greg

After the feedback received in the language summit, we have made a modification to the proposed migration plan in PEP 617 so the new parser will be the default in 3.9alpha6: https://github.com/python/peps/pull/1369

The PEP is exciting and is very clearly presented, thank you all for the hard work! Considering the comments in the PEP about the new parser not preserving a parse tree or CST, I have some questions about the future options for Python language-services tooling which requires a CST in order to round-trip and modify Python code. Examples in this space include auto-formatters, refactoring tools, linters with autofix, etc. Today many such tools (e.g. Black, 2to3) are based on lib2to3. Other tools already have their own parser (e.g. LibCST -- which I help maintain -- and Jedi both use parso, a fork of pgen2). 1) 2to3 and lib2to3 are not mentioned in the PEP, but are a documented part of the standard library used by some very popular tools, and currently depend on pgen2. A quick search of the PEP 617 pull request does not suggest that it modifies lib2to3. Will lib2to3 also be removed in Python 3.10 along with the old parser? It might be good for the PEP to address the future of 2to3 and lib2to3 explicitly. 2) As these tools make the necessary adaptations to support Python 3.10, which may no longer be parsable with an LL(1) parser, will we be able to leverage any part of pegen to construct a lossless Python CST, or will we likely need to fork pegen outside of CPython or build a wholly new parser? It would be neat if an alternate grammar could be written in pegen that has access to all tokens (including NL and COMMENT) for this purpose; that would save a lot of code duplication and potential for inconsistency. I haven't had a chance to fully read through the PEP 617 pull request, but it looks like its tokenizer wrapper currently discards NL and COMMENT. I understand this is a distinct use case with distinct needs and I'm not suggesting that we should make significant sacrifices in the performance or maintainability of pegen to serve it, but if it's possible to enable some sharing by making API choices now before it's merged, that seems worth considering. Carl

On Sat, Apr 18, 2020 at 4:53 PM Carl Meyer <carl@oddbird.net> wrote:
Right, LibCST is very exciting. Note that AFAIK none of the tools you mention depend on the old parser module. (Though I'm not denying that there might be tools depending on it -- that's why we're keeping it until 3.10.)
Note that, while there is indeed a docs page about 2to3 <https://docs.python.org/3/library/2to3.html>, the only docs for *lib2to3* in the standard library reference are a link to the source code and a single "*Note:* The lib2to3 <https://docs.python.org/3/library/2to3.html?highlight=lib2to3#module-lib2to3> API should be considered unstable and may change drastically in the future." Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)
You've mentioned a few different tools that already use different technologies: LibCST depends on parso which has a fork of pgen2, lib2to3 which has the original pgen2. I wonder if this would be an opportunity to move such parsing support out of the standard library completely. There are already two versions of pegen, but neither is in the standard library: there is the original pegen <https://github.com/gvanrossum/pegen/> repo which is where things started, and there is a fork of that code in the CPython Tools <https://github.com/we-like-parsers/cpython/tree/pegen/Tools/peg_generator> directory (not yet in the upstream repo, but see PR 19503 <https://github.com/python/cpython/pull/19503>). The pegen tool has two generators, one generating C code and one generating Python code. I think that the C generator is really only relevant for CPython itself: it relies on the builtin tokenizer (the one written in C, not the stdlib tokenize.py) and the generated C code depends on many internal APIs. In fact the C generator in the original pegen repo doesn't work with Python 3.9 because those internal APIs are no longer exported. (It also doesn't work with Python 3.7 or older because it makes critical use of the walrus operator. :-) Also, once we started getting serious about replacing the old parser, we worked exclusively on the C generator in the CPython Tools directory, so the version in the original pegen repo is lagging quite a bit behind (is is the Python grammar in that repo). But as I said you're not gonna need it. On the other hand, the Python generator is designed to be flexible, and while it defaults to using the stdlib tokenize.py tokenizer, you can easily hook up your own. Putting this version in the stdlib would be a mistake, because the code is pretty immature; it is really waiting for a good home, and if parso or LibCST were to decide to incorporate a fork of it and develop it into a high quality parser generator for Python-like languages that would be great. I wouldn't worry much about the duplication of code -- the Python generator in the CPython Tools directory is only used for one purpose, and that is to produce the meta-parser (the parser for grammars) from the meta-grammar. And I would happily stop developing the original pegen once a fork is being developed. Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought <https://github.com/we-like-parsers/cpython/issues/84> about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this. Finally, I should recognize the important influence of my mentor in PEG parsing, Juancarlo Añez <https://github.com/apalala/>. Without his early encouragement and advice I would never have been able to travel this road. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sat, Apr 18, 2020 at 10:38 PM Guido van Rossum <guido@python.org> wrote:
Note that, while there is indeed a docs page about 2to3, the only docs for lib2to3 in the standard library reference are a link to the source code and a single "Note: The lib2to3 API should be considered unstable and may change drastically in the future."
Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)
Indeed! Thanks for clarifying, I now recall that I already knew it doesn't, but forgot. The docs page for 2to3 does currently say "lib2to3 could also be adapted to custom applications in which Python code needs to be edited automatically." Perhaps at least this sentence should be removed, and maybe also replaced with a clearer note that lib2to3 not only has an unstable API, but also should not necessarily be expected to continue to parse future Python versions, and thus building tools on top of it should be discouraged rather than recommended. (Maybe even use the word "deprecated.") Happy to submit a PR for this if you agree it's warranted. It still seems to me that it wouldn't hurt for PEP 617 itself to also mention this shift in lib2to3's effective status (from "available but no API stability guarantee" to "probably will not parse future Python versions") as one of its indirect effects.
Thanks, this is all very clarifying! I hadn't even found the original gvanrossum/pegen repo, and was just looking at the CPython PR for PEP 617. Clearly I haven't been following this work closely.
Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this.
This seems like the place to start. When we start work on Python 3.10 support for LibCST, we can start with trying to use and adapt pegen in place of the vendored fork of parso we currently use, and if that's promising enough, consider taking over maintenance of it. Carl

Great! Please submit a PR to update the [lib]2to3 docs and CC me (@gvanrossum). While perhaps it wouldn't hurt if the PEP mentioned lib2to3, it was just accepted by the Steering Council without such language, and I wouldn't want to imply that the SC agrees with everything I said. So I still think we ought to deal with lib2to3 independently (and no, it won't need its own PEP :-). A reasonable option would be to just deprecate it and recommend people use parso, LibCST or something else (I wouldn't recommend pegen in its current form yet). On Tue, Apr 21, 2020 at 6:21 PM Carl Meyer <carl@oddbird.net> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Could we go ahead and mark lib2to3 as Pending Deprecation in 3.9 so we can get it out of the stdlib by 3.11 or 3.12? lib2to3 is the basis of all sorts of general source code manipulation tooling. Its name and original reason d'etre have moved on. It is actively used to parse and rewrite Python 3 code all the time. yapf uses it, black uses a fork of it. Other Python code manipulation tooling uses it. Modernize like fixers are useful for all sorts of cleanups. IMNSHO it would be better if lib2to3 were *not* in the stdlib anymore - Black already chose to fork lib2to3 <https://github.com/psf/black/tree/master/blib2to3>. So given that it is eventually not going to be able to parse future syntax, the better answer seems like deprecation, putting the final version up on PyPI and letting any descendants of it live on PyPI where they can get more active care than a stdlib module ever does. -gps On Tue, Apr 21, 2020 at 6:58 PM Guido van Rossum <guido@python.org> wrote:

On Tue, Apr 21, 2020 at 9:35 PM Gregory P. Smith <greg@krypto.org> wrote:
Could we go ahead and mark lib2to3 as Pending Deprecation in 3.9 so we can get it out of the stdlib by 3.11 or 3.12?
I'm going ahead and tracking the idea in https://bugs.python.org/issue40360.

Hi Guido, Pablo & Lysandros, I'm excited about this improvement to Python, and was interested to hear about it at the language summit as well. I happen to be friends with Alessandro Warth, whom you cited in the PEP as developing the packrat parsing technique you use (at least in part). I wrote to him to ask if he knew being cited, and he responded in part with these comments. The additional link may perhaps be useful for you: Alex: (If they had gotten in touch, I would have pointed them at my
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
participants (22)
-
Barry Warsaw
-
Batuhan Taskaya
-
Brett Cannon
-
Carl Meyer
-
Chris Angelico
-
David Mertz
-
Fabio Zadrozny
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
Ivan Levkivskyi
-
Jeff Allen
-
Jelle Zijlstra
-
Matt Billenstein
-
Nam Nguyen
-
Nathaniel Smith
-
Pablo Galindo Salgado
-
Paul Moore
-
Steve Holden
-
Steven D'Aprano
-
Thomas Wouters
-
Victor Stinner