[New-bugs-announce] [issue42729] tokenize, ast: No direct way to parse tokens into AST, a gap in the language processing pipiline

Paul Sokolovsky report at bugs.python.org
Thu Dec 24 05:19:55 EST 2020

New submission from Paul Sokolovsky <pfalcon at users.sourceforge.net>:

Currently, it's possible:

* To get from stream-of-characters program representation to AST representation (AST.parse()).
* To get from AST to code object (compile()).
* To get from a code object to first-class function to the execute the program.

Python also offers "tokenize" module, but it stands as a disconnected island: the only things it allows to do is to get from stream-of-characters program representation to stream-of-tokens, and back. At the same time, conceptually, tokenization is not a disconnected feature, it's the first stage of language processing pipeline. The fact that "tokenize" is disconnected from the rest of the pipeline, as listed above, is more an artifact of CPython implementation: both "ast" module and compile() module are backed by the underlying bytecode compiler implementation written in C, and that's what connects them.

On the other hand, "tokenize" module is pure-Python, while the underlying compiler has its own tokenizer implementation (not exposed). That's the likely reason of such disconnection between "tokenize" and the rest of the infrastructure.

I propose to close that gap, and establish an API which would allow to parse token stream (iterable) into an AST. An initial implementation for CPython can (and likely should) be naive, making a loop thru surface program representation. That's ok, again, the idea is to establish a standard API to be able to go tokens -> AST, then individual Python implementation can make/optimize it based on their needs.

The proposed name is ast.parse_tokens(). It follows the signature of the existing ast.parse(), except that first parameter is "token_stream" instead of "source".

Another alternative would be to overload existing ast.parse() to accept token iterable. I guess, at the current stage, where we try to tighten up type strictness of API, and have clear typing signatures for API functions, this is not favored solution.

components: Library (Lib)
messages: 383680
nosy: BTaskaya, pablogsal, pfalcon, serhiy.storchaka
priority: normal
severity: normal
status: open
title: tokenize, ast: No direct way to parse tokens into AST, a gap in the language processing pipiline
versions: Python 3.10

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list