Mailman 3 Parser module in the stdlib - Python-Dev

Parser module in the stdlib

older
Using Generic Type Constructors at...

Pablo Galindo Salgado

16 May 2019 16 May '19

9:12 p.m.

Hi everyone, TLDR ===== I propose to remove the current parser module and expose pgen2 as a standard library module. Some context =========== The parser module has been "deprecated" (technically we recommend to prefer the ast module instead) since Python2.5 but is still in the standard library. Is a 1222-line C module that needs to be kept in sync with the grammar and the Python parser sometimes by hand. It has also been broken for several years: I recently fixed a bug that was introduced in Python 3.5 that caused the parse module to not being able to parse if-else blocks ( https://bugs.python.org/issue36256). The interface it provides is a very raw view of the CST. For instance:

...

...
...
parser.sequence2st(parser.suite("def f(x,y,z): pass").totuple())

provides an object with methods compile, isexpr, issuite, tolist, totuple. The last two produce containers with the numerical values of the grammar elements (tokens, dfas...etc)

...

...
...
parser.suite("def f(x,y,z): pass").tolist() [257, [269, [295, [263, [1, 'def'], [1, 'f'], [264, [7, '('], [265, [266, [1, 'x']], [12, ','], [266, [1, 'y']], [12, ','], [266, [1, 'z']]], [8, ')']], [11, ':'], [304, [270, [271, [277, [1, 'pass']]], [4, '']]]]]], [4, ''], [0, '']]

This is a very raw interface and is very tied to the particularities of CPython without almost any abstraction. On the other hand, there is a Python implementation of the Python parser and parser generator in lib2to3 (pgen2). This module is not documented and is usually considered an implementation detail of lib2to3 but is extremely useful. Several 3rd party packages (black, fissix...) are using it directly or have their own forks due to the fact that it can get outdated with respect to the Python3 grammar as it was originally used only for Python2 to Python3 migration. It has the ability to consume LL1 grammars in EBNF form and produces an LL1 parser for them (by creating parser tables that the same module can consume). Many people use the module currently to support or parse supersets of Python (like Python2/3 compatible grammars, cython-like changes...etc). Proposition ======== I propose to remove finally the parser module as it has been "deprecated" for a long time, is almost clear that nobody uses it and has very limited usability and replace it (maybe with a different name) with pgen2 (maybe with a more generic interface that is detached to lib2to3 particularities). This will not only help a lot current libraries that are using forks or similar solutions but also will help to keep synchronized the shipped grammar (that is able to parse Python2 and Python3 code) with the current Python one (as now will be more justified to keep them in sync). What do people think about? Do you have any concerns? Do you think is a good/bad idea? Regards from sunny London, Pablo Galindo

Attachments:

attachment.htm (text/html — 5.3 KB)

Show replies by date

Victor Stinner

16 May 16 May

10:12 p.m.

Le jeu. 16 mai 2019 à 23:15, Pablo Galindo Salgado <pablogsal@gmail.com> a écrit :

...

The parser module has been "deprecated" (technically we recommend to prefer the ast module instead) since Python2.5 but is still in the standard library.

Importing it doesn't emit a DeprecationgWarning. It's only deprecated in the documentation: https://docs.python.org/dev/library/parser.html I searched for "import parser" in Python on GitHub. I looked at the 10 pages of results: I cannot find any code which looks to import "parser" from the Python standard library. Only "import parser" in a test suite of PyPy, something unrelated.

...

...
...
...
parser.suite("def f(x,y,z): pass").tolist() [257, [269, [295, [263, [1, 'def'], [1, 'f'], [264, [7, '('], [265, [266, [1, 'x']], [12, ','], [266, [1, 'y']], [12, ','], [266, [1, 'z']]], [8, ')']], [11, ':'], [304, [270, [271, [277, [1, 'pass']]], [4, '']]]]]], [4, ''], [0, '']]

I don't understand how anyone can read or use Concrete Syntax Tree (CST) directly :-( I would expect that you need another module on top of that to use a CST. But I'm not aware of anything in the stdlib, so I understand that... nobody uses this "parser" module. I never heard about the "parser" module in the stdlib before this email :-)

...

Several 3rd party packages (black, fissix...) are using it directly or have their own forks due to the fact that it can get outdated with respect to the Python3 grammar as it was originally used only for Python2 to Python3 migration. It has the ability to consume LL1 grammars in EBNF form and produces an LL1 parser for them (by creating parser tables that the same module can consume). Many people use the module currently to support or parse supersets of Python (like Python2/3 compatible grammars, cython-like changes...etc).

* Did someone review the pgen2 implementation? * Does it have a good API? * Does it come with documentation? * Who is maintaining it? Who will maintain it?

...

I propose to remove finally the parser module as it has been "deprecated" for a long time, is almost clear that nobody uses it and has very limited usability and replace it (maybe with a different name) with pgen2 (maybe with a more generic interface that is detached to lib2to3 particularities).

I'm not sure that these two things have to be done at the same time. Maybe we can start by issuing a DeprecationWarning on "import parser" and announce the future removal in What's New in Python 3.8, just in case, and wait one more release before removing it. But I'm also fine with removing it right now. As I wrote, I never heard about this module previously... If your proposal is to reuse "parser" name, I'm not sure that it's a good idea, since it existed for years and had a different API. Why not keeping "pgen2" name, you said that it's already used under this name in the wild. Note: One thing to consider is that https://pypi.org/project/pgen2/ name is already used. It might conflict if the "new" pgen2 (pgen3? :-D) would have a little bit different API. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Nathaniel Smith

10:41 p.m.

On Thu, May 16, 2019 at 2:13 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

I propose to remove finally the parser module as it has been "deprecated" for a long time, is almost clear that nobody uses it and has very limited usability and replace it (maybe with a different name) with pgen2 (maybe with a more generic interface that is detached to lib2to3 particularities). This will not only help a lot current libraries that are using forks or similar solutions but also will help to keep synchronized the shipped grammar (that is able to parse Python2 and Python3 code) with the current Python one (as now will be more justified to keep them in sync).

Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if the stdlib version were public, since the PyPI version can be updated independently of the host Python. -n -- Nathaniel J. Smith -- https://vorpus.org

Pablo Galindo Salgado

10:48 p.m.

...

Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if he stdlib version were public, since the PyPI version can be updated independently of the host Python. The tool can parse arbitrary grammars, the one that is packed into is just one of them.

I think it would be useful, among other things because the standard library lacks currently a proper CST solution. The ast module is heavily leveraged for things like formatters, static code analyzers...etc but CST can be very useful as Łukasz describes here: https://bugs.python.org/issue33337 I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly). On Thu, 16 May 2019 at 23:41, Nathaniel Smith <njs@pobox.com> wrote:

...

...
I propose to remove finally the parser module as it has been "deprecated" for a long time, is almost clear that nobody uses it and has very limited usability and replace it (maybe with a different name) with pgen2 (maybe with a more generic interface that is detached to

On Thu, May 16, 2019 at 2:13 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote: lib2to3 particularities). This will not only help a lot current libraries that are using forks or similar solutions but also will help to keep

...
synchronized the shipped grammar (that is able to parse Python2 and Python3 code) with the current Python one (as now will be more justified to keep them in sync).

Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if the stdlib version were public, since the PyPI version can be updated independently of the host Python.

-n

-- Nathaniel J. Smith -- https://vorpus.org

Steve Dower

10:55 p.m.

On 16May2019 1548, Pablo Galindo Salgado wrote:

...

...
Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if he stdlib version were public, since the PyPI version can be updated independently of the host Python. The tool can parse arbitrary grammars, the one that is packed into is just one of them.

I think it would be useful, among other things because the standard library lacks currently a proper CST solution. The ast module is heavily leveraged for things like formatters, static code analyzers...etc but CST can be very useful as Łukasz describes here:

https://bugs.python.org/issue33337

I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly).

We still have the policy of not removing modules that exist in the Python 2 standard library. But 3.9 won't be covered by that :) But I'm in favor of having a proper CST module that matches the version of Python it's in. It doesn't help people on earlier versions (yet), but given how closely tied it is to the Python version you're on I think it makes sense in the stdlib. Cheers, Steve

Brett Cannon

17 May 17 May

2:04 a.m.

On Thu., May 16, 2019, 15:56 Steve Dower, <steve.dower@python.org> wrote:

...

...
...
Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if he stdlib version were public, since the PyPI version can be updated independently of the host Python. The tool can parse arbitrary grammars, the one that is packed into is just one of them.

I think it would be useful, among other things because the standard

On 16May2019 1548, Pablo Galindo Salgado wrote: library

...
lacks currently a proper CST solution. The ast module is heavily leveraged for things like formatters, static code analyzers...etc but CST can be very useful as Łukasz describes here:

https://bugs.python.org/issue33337

I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly).

We still have the policy of not removing modules that exist in the Python 2 standard library. But 3.9 won't be covered by that :)

Correct. 😁 We should deprecate in 3.8 for removal in 3.9.

...

But I'm in favor of having a proper CST module that matches the version of Python it's in. It doesn't help people on earlier versions (yet), but given how closely tied it is to the Python version you're on I think it makes sense in the stdlib.

+1. I think someone brought up the API, and probably talking to projects that are using pgen2 out of lib2to3 would be good. -Brett

...

Cheers, Steve _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Guido van Rossum

20 May 20 May

3:55 p.m.

On Thu, May 16, 2019 at 3:57 PM Steve Dower <steve.dower@python.org> wrote:

...

[...] We still have the policy of not removing modules that exist in the Python 2 standard library. But 3.9 won't be covered by that :)

I didn't even remember that. Where's that written down? And by the time 3.8 .0(final) comes out, 2.7 has only about two months of life left... FWIW I am strongly in favor of getting rid of the `parser` module, in 3.8 is we can, otherwise in 3.9 (after strong deprecation in 3.8). I am interested in switching CPython's parsing strategy to something else (what exactly remains to be seen) and any new approach is unlikely to reuse the current CST technology. (OTOH I think it would be wise to keep the current AST.)

...

But I'm in favor of having a proper CST module that matches the version of Python it's in. It doesn't help people on earlier versions (yet), but given how closely tied it is to the Python version you're on I think it makes sense in the stdlib.

I presume this is in reference to Łukasz's https://bugs.python.org/issue33337. I think we should act on that issue, but I don't think there's a good reason to tie deletion (or deprecation) of the `parser` module to whatever we do there. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Steven D'Aprano

6:27 p.m.

On Mon, May 20, 2019 at 08:55:59AM -0700, Guido van Rossum wrote:

...

I am interested in switching CPython's parsing strategy to something else (what exactly remains to be seen)

Are you re-thinking the restriction to LL(1) grammars? -- Steven

Guido van Rossum

6:49 p.m.

On Mon, May 20, 2019 at 11:29 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Mon, May 20, 2019 at 08:55:59AM -0700, Guido van Rossum wrote:

...
I am interested in switching CPython's parsing strategy to something else (what exactly remains to be seen)

Are you re-thinking the restriction to LL(1) grammars?

Indeed. I think it served us well for the first 10-15 years, but now it is just a burden. See a thread I started at Discourse: https://discuss.python.org/t/switch-pythons-parsing-tech-to-something-more-p... And some followup there: https://discuss.python.org/t/preparing-for-new-python-parsing/1550 Note that this is very much speculative. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Terry Reedy

10:59 p.m.

On 5/20/2019 11:55 AM, Guido van Rossum wrote:

...

On Thu, May 16, 2019 at 3:57 PM Steve Dower <steve.dower@python.org <mailto:steve.dower@python.org>> wrote:

[...] We still have the policy of not removing modules that exist in the Python 2 standard library. But 3.9 won't be covered by that :)

I didn't even remember that. Where's that written down?

AFAIK, there has been no BDFL pronouncement or coredev vote. But it has become somewhat of a consensus as a result of discussions in multliple threads about removing modules going back to the 3.4 era. PEP 594, just posted, proposes to codify a 'soft' (or perhaps 'slow') version of the policy: doc deprecation, code deprecation, and code removal in 3.8, 3.9, 3.10.

...

FWIW I am strongly in favor of getting rid of the `parser` module, in 3.8 if we can, otherwise in 3.9 (after strong deprecation in 3.8).

You could request that 'parser' be specifically excluded from the PEP, along with 'distutils' (and effectively anything else not specifically named). Or request that it be included, but with an accelerated schedule. I have a vague idea of why you think it harmful to keep it around, but a reminder would not hurt ;-). -- Terry Jan Reedy

Brett Cannon

21 May 21 May

5:45 p.m.

On Mon., May 20, 2019, 16:00 Terry Reedy, <tjreedy@udel.edu> wrote:

...

On 5/20/2019 11:55 AM, Guido van Rossum wrote:

...
On Thu, May 16, 2019 at 3:57 PM Steve Dower <steve.dower@python.org <mailto:steve.dower@python.org>> wrote:

[...] We still have the policy of not removing modules that exist in the Python 2 standard library. But 3.9 won't be covered by that :)

I didn't even remember that. Where's that written down?

AFAIK, there has been no BDFL pronouncement or coredev vote.

It's in PEP 4 since at least Python 3.5: https://www.python.org/dev/peps/pep-0004/#for-modules-existing-in-both-pytho... . -Brett But it has

...

become somewhat of a consensus as a result of discussions in multliple threads about removing modules going back to the 3.4 era.

PEP 594, just posted, proposes to codify a 'soft' (or perhaps 'slow') version of the policy: doc deprecation, code deprecation, and code removal in 3.8, 3.9, 3.10.

...
FWIW I am strongly in favor of getting rid of the `parser` module, in 3.8 if we can, otherwise in 3.9 (after strong deprecation in 3.8).

You could request that 'parser' be specifically excluded from the PEP, along with 'distutils' (and effectively anything else not specifically named). Or request that it be included, but with an accelerated schedule. I have a vague idea of why you think it harmful to keep it around, but a reminder would not hurt ;-).

-- Terry Jan Reedy

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

Guido van Rossum

20 May 20 May

4:28 p.m.

On Thu, May 16, 2019 at 3:51 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

[Nathaniel Smith]

Will the folks using forks be happy to switch to the stdlib version?

...
For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if he stdlib version were public, since the PyPI version can be updated independently of the host Python.

The tool can parse arbitrary grammars, the one that is packed into is just one of them.

I think it would be useful, among other things because the standard library lacks currently a proper CST solution. The ast module is heavily leveraged for things like formatters,

Actually, I think the `ast` module doesn't work very well for formatters, because it loses comments. (Retaining comments and all details of whitespace is the specific use case for which I created pgen2.)

...

static code analyzers...etc but CST can be very useful as Łukasz describes here:

https://bugs.python.org/issue33337

I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly).

I wonder if lib2to3 is actually something that would benefit from moving out of the stdlib. (Wasn't it on Amber's list?) As Łukasz points out in that issue, it is outdated. Maybe if it was out of the stdlib it would attract more contributors. Then again, I have recently started exploring the idea of a PEG parser for Python. Maybe a revamped version of the core of lib2to3 based on PEG instead of pgen would be interesting to some folks. I do agree that the two versions of tokenize.py should be unified (and the result kept in the stdlib). However there are some issues here, because tokenize.py is closely tied to the names and numbers of the tokens, and the latter information is currently generated based on the contents of the Grammar file. This may get in the way of using it to tokenize old Python versions. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Pablo Galindo Salgado

6:05 p.m.

...

Actually, I think the `ast` module doesn't work very well for formatters, because it loses comments. (Retaining comments and all details of whitespace is the specific use case for which I created pgen2.)

...

I wonder if lib2to3 is actually something that would benefit from moving out of the stdlib. (Wasn't it on Amber's list?) As Łukasz points out in

...

I am interested in switching CPython's parsing strategy to something else (what exactly remains to be seen) and any new approach is unlikely to reuse

Some uses I have seen include using it to check that the code before and after the formatting has no functional changes (both have the same ast) or to augment the information obtained with other sources. But yeah, I agree that static code analyzers and linters are a much bigger target. that issue, it is outdated. Maybe if it was out of the stdlib it would attract more contributors. Then again, I have recently started exploring the idea of a PEG parser for Python. Maybe a revamped version of the core of lib2to3 based on PEG instead of pgen would be interesting to some folks. I was thinking more on the line of leveraging some parts lib2to3 having some CST-related solution similar to the ast module, not exposing the whole functionality of lib2to3. Basically, it would be a more high-level abstraction to substitute the current parser module. Technically you should be able to reconstruct some primitives that lib2to3 uses on top of the output that the parser module generates (modulo some extra information from the grammar), but the raw output that the parser module generates is not super useful by itself, especially when you consider the maintenance costs. On the other side, as you mention here: the current CST technology. (OTOH I think it would be wise to keep the current AST.) it is true that changing the parser can influence greatly the hypothetical CST module so it may complicate the conversion to a new parser solution if the API does not abstract enough (or it may be close to impractical depending on the new parser solution). My original suggestion was based on the fact that the parser module is not super useful and it has a great maintenance cost, but the "realm" of what it solves (providing access to the parse trees) could be useful to some use cases so that is why I was talking about "parser" and lib2to3 in the same email. Perhaps we can be more productive if we focus on just deprecating the "parser" module, but I thought it was an opportunity to solve two (related) problems at once. On Mon, 20 May 2019 at 17:28, Guido van Rossum <guido@python.org> wrote:

...

On Thu, May 16, 2019 at 3:51 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
[Nathaniel Smith]

Will the folks using forks be happy to switch to the stdlib version?

...
For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if he stdlib version were public, since the PyPI version can be updated independently of the host Python.

The tool can parse arbitrary grammars, the one that is packed into is just one of them.

I think it would be useful, among other things because the standard library lacks currently a proper CST solution. The ast module is heavily leveraged for things like formatters,

Actually, I think the `ast` module doesn't work very well for formatters, because it loses comments. (Retaining comments and all details of whitespace is the specific use case for which I created pgen2.)

...
static code analyzers...etc but CST can be very useful as Łukasz describes here:

https://bugs.python.org/issue33337

I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly).

I wonder if lib2to3 is actually something that would benefit from moving out of the stdlib. (Wasn't it on Amber's list?) As Łukasz points out in that issue, it is outdated. Maybe if it was out of the stdlib it would attract more contributors. Then again, I have recently started exploring the idea of a PEG parser for Python. Maybe a revamped version of the core of lib2to3 based on PEG instead of pgen would be interesting to some folks.

I do agree that the two versions of tokenize.py should be unified (and the result kept in the stdlib). However there are some issues here, because tokenize.py is closely tied to the names and numbers of the tokens, and the latter information is currently generated based on the contents of the Grammar file. This may get in the way of using it to tokenize old Python versions.

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Victor Stinner

21 May 21 May

11:15 p.m.

Le ven. 17 mai 2019 à 00:44, Nathaniel Smith <njs@pobox.com> a écrit :

...

Will the folks using forks be happy to switch to the stdlib version? For example I can imagine that if black wants to process 3.7 input code while running on 3.6, it might prefer a parser on PyPI even if the stdlib version were public, since the PyPI version can be updated independently of the host Python.

Pablo wrote:

...

I think is missing an important gap in the stdlib and the closest thing we have (the current parser module) is not useful for any of that. Also, the core to generating the hypothetical new package (with some new API over it may be) is already undocumented as an implementation detail of lib2to3 (and some people are already using it directly).

IMHO it's way better to first mature a module on PyPI to ensure that it becomes: * battle-tested * have a well defined and *stable* API * have a good test suite Right now, everything looks too vague to me. Reminder: the stdlib basically only gets a release every 2 years. If you forget a super important feature, you have to wait 2 years to add it. Even for bugfixes, you also have to wait months if not years until all vendors upgrade their Python package... I'm not sure that putting it directly in the stdlib is a good idea. Victor -- Night gathers, and now my watch begins. It shall not end until my death.