On Sat, Apr 18, 2020 at 10:38 PM Guido van Rossum <guido@python.org> wrote:
Note that, while there is indeed a docs page about 2to3, the only docs for lib2to3 in the standard library reference are a link to the source code and a single "Note: The lib2to3 API should be considered unstable and may change drastically in the future."
Fortunately, in order to support the 2to3 application, lib2to3 doesn't need to change, because the syntax of Python 2 is no longer changing. :-) Choosing to remove 2to3 is an independent decision. And lib2to3 does not depend in any way on the old parser module. (It doesn't even use the standard tokenize module, but incorporates its own version that is slightly tweaked to support Python 2.)
Indeed! Thanks for clarifying, I now recall that I already knew it doesn't, but forgot. The docs page for 2to3 does currently say "lib2to3 could also be adapted to custom applications in which Python code needs to be edited automatically." Perhaps at least this sentence should be removed, and maybe also replaced with a clearer note that lib2to3 not only has an unstable API, but also should not necessarily be expected to continue to parse future Python versions, and thus building tools on top of it should be discouraged rather than recommended. (Maybe even use the word "deprecated.") Happy to submit a PR for this if you agree it's warranted. It still seems to me that it wouldn't hurt for PEP 617 itself to also mention this shift in lib2to3's effective status (from "available but no API stability guarantee" to "probably will not parse future Python versions") as one of its indirect effects.
You've mentioned a few different tools that already use different technologies: LibCST depends on parso which has a fork of pgen2, lib2to3 which has the original pgen2. I wonder if this would be an opportunity to move such parsing support out of the standard library completely. There are already two versions of pegen, but neither is in the standard library: there is the original pegen repo which is where things started, and there is a fork of that code in the CPython Tools directory (not yet in the upstream repo, but see PR 19503).
The pegen tool has two generators, one generating C code and one generating Python code. I think that the C generator is really only relevant for CPython itself: it relies on the builtin tokenizer (the one written in C, not the stdlib tokenize.py) and the generated C code depends on many internal APIs. In fact the C generator in the original pegen repo doesn't work with Python 3.9 because those internal APIs are no longer exported. (It also doesn't work with Python 3.7 or older because it makes critical use of the walrus operator. :-) Also, once we started getting serious about replacing the old parser, we worked exclusively on the C generator in the CPython Tools directory, so the version in the original pegen repo is lagging quite a bit behind (is is the Python grammar in that repo). But as I said you're not gonna need it.
On the other hand, the Python generator is designed to be flexible, and while it defaults to using the stdlib tokenize.py tokenizer, you can easily hook up your own. Putting this version in the stdlib would be a mistake, because the code is pretty immature; it is really waiting for a good home, and if parso or LibCST were to decide to incorporate a fork of it and develop it into a high quality parser generator for Python-like languages that would be great. I wouldn't worry much about the duplication of code -- the Python generator in the CPython Tools directory is only used for one purpose, and that is to produce the meta-parser (the parser for grammars) from the meta-grammar. And I would happily stop developing the original pegen once a fork is being developed.
Thanks, this is all very clarifying! I hadn't even found the original gvanrossum/pegen repo, and was just looking at the CPython PR for PEP 617. Clearly I haven't been following this work closely.
Another option would be to just improve the python generator in the original pegen repo to satisfy the needs of parso and LibCST. Reading the blurb for parso it looks like it really just parses *Python*, which is less ambitious than pegen. But it also seems to support error recovery, which currently isn't part of pegen. (However, we've thought about it.) Anyway, regardless of how exactly this is structured someone will probably have to take over development and support. Pegen started out as a hobby project to educate myself about PEG parsers. Then I wrote a bunch of blog posts about my approach, and finally I started working on using it to generate a replacement for the old pgen-based parser. But I never found the time to make it an appealing parser generator tool for other languages, even though that was on my mind as a future possibility. It will take some time to disentangle all this, and I'd be happy to help someone who wants to work on this.
This seems like the place to start. When we start work on Python 3.10 support for LibCST, we can start with trying to use and adapt pegen in place of the vendored fork of parso we currently use, and if that's promising enough, consider taking over maintenance of it. Carl