On Sat, Apr 18, 2020 at 4:53 PM Carl Meyer <carl@oddbird.net> wrote:

The PEP is exciting and is very clearly presented, thank you all for
the hard work!

Considering the comments in the PEP about the new parser not
preserving a parse tree or CST, I have some questions about the future
options for Python language-services tooling which requires a CST in
order to round-trip and modify Python code. Examples in this space
include auto-formatters, refactoring tools, linters with autofix, etc.
Today many such tools (e.g. Black, 2to3) are based on lib2to3. Other
tools already have their own parser (e.g. LibCST -- which I help
maintain -- and Jedi both use parso, a fork of pgen2).

Right, LibCST is very exciting. Note that AFAIK none of the tools you mention depend on the old parser module. (Though I'm not denying that there might be tools depending on it -- that's why we're keeping it until 3.10.)

1) 2to3 and lib2to3 are not mentioned in the PEP, but are a documented
part of the standard library used by some very popular tools, and
currently depend on pgen2. A quick search of the PEP 617 pull request
does not suggest that it modifies lib2to3. Will lib2to3 also be
removed in Python 3.10 along with the old parser? It might be good for
the PEP to address the future of 2to3 and lib2to3 explicitly.

Note that, while there is indeed a docs page about 2to3, the only docs for lib2to3 in the standard library reference are a link to the source code and a single "Note: The lib2to3 API should be considered unstable and may change drastically in the future."

Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)

2) As these tools make the necessary adaptations to support Python
3.10, which may no longer be parsable with an LL(1) parser, will we be
able to leverage any part of pegen to construct a lossless Python CST,
or will we likely need to fork pegen outside of CPython or build a
wholly new parser? It would be neat if an alternate grammar could be
written in pegen that has access to all tokens (including NL and
COMMENT) for this purpose; that would save a lot of code duplication
and potential for inconsistency. I haven't had a chance to fully read
through the PEP 617 pull request, but it looks like its tokenizer
wrapper currently discards NL and COMMENT. I understand this is a
distinct use case with distinct needs and I'm not suggesting that we
should make significant sacrifices in the performance or
maintainability of pegen to serve it, but if it's possible to enable
some sharing by making API choices now before it's merged, that seems
worth considering.

You've mentioned a few different tools that already use different technologies: LibCST depends on parso which has a fork of pgen2, lib2to3 which has the original pgen2. I wonder if this would be an opportunity to move such parsing support out of the standard library completely. There are already two versions of pegen, but neither is in the standard library: there is the original pegen repo which is where things started, and there is a fork of that code in the CPython Tools directory (not yet in the upstream repo, but see PR 19503).

The pegen tool has two generators, one generating C code and one generating Python code. I think that the C generator is really only relevant for CPython itself: it relies on the builtin tokenizer (the one written in C, not the stdlib tokenize.py) and the generated C code depends on many internal APIs. In fact the C generator in the original pegen repo doesn't work with Python 3.9 because those internal APIs are no longer exported. (It also doesn't work with Python 3.7 or older because it makes critical use of the walrus operator. :-) Also, once we started getting serious about replacing the old parser, we worked exclusively on the C generator in the CPython Tools directory, so the version in the original pegen repo is lagging quite a bit behind (is is the Python grammar in that repo). But as I said you're not gonna need it.

On the other hand, the Python generator is designed to be flexible, and while it defaults to using the stdlib tokenize.py tokenizer, you can easily hook up your own. Putting this version in the stdlib would be a mistake, because the code is pretty immature; it is really waiting for a good home, and if parso or LibCST were to decide to incorporate a fork of it and develop it into a high quality parser generator for Python-like languages that would be great. I wouldn't worry much about the duplication of code -- the Python generator in the CPython Tools directory is only used for one purpose, and that is to produce the meta-parser (the parser for grammars) from the meta-grammar. And I would happily stop developing the original pegen once a fork is being developed.

Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this.

Finally, I should recognize the important influence of my mentor in PEG parsing, Juancarlo Añez. Without his early encouragement and advice I would never have been able to travel this road.

--Guido van Rossum (python.org/~guido)

Pronouns: he/him (why is my pronoun here?)