[Python-ideas] Hooking between lexer and parser
mistersheik at gmail.com
Fri Jun 5 22:40:04 CEST 2015
While we're at it, we can also fix "(1if 0 else 2)" :)
On Friday, June 5, 2015 at 4:38:27 PM UTC-4, Neil Girdhar wrote:
> Actually CPython has another step between the AST and the bytecode, which
> validates the AST to block out trees that violate various rules that were
> not easily incorporated into the LL(1) grammar. This means that when you
> want to change parsing, you have to change: the grammar, the AST library,
> the validation library, and Python's exposed parsing module.
> Modern parsers do not separate the grammar from tokenizing, parsing, and
> validation. All of these are done in one place, which not only simplifies
> changes to the grammar, but also protects you from possible
> inconsistencies. It was really hard for me when I was making changes to
> the parser to keep my conception of these four things synchronized.
> So in my opinion, if you're going to modernize the parsing, then put it
> all together into one simple library that deals with all of it. It seems
> like what you're suggesting would add complexity, whereas a merged solution
> would simplify the code. If it's hard to write a fast parser, then
> consider writing a parser generator in Python that generates the C code you
> On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via
> Python-ideas wrote:
>> Compiling a module has four steps:
>> * bytes->str (based on encoding declaration or default)
>> * str->token stream
>> * token stream->AST
>> * AST->bytecode
>> You can very easily hook at every point in that process except the token
>> There _is_ a workaround: re-encode the text to bytes, wrap it in a
>> BytesIO, call tokenize, munge the token stream, call untokenize, re-decode
>> back to text, then pass that to compile or ast.parse. But, besides being a
>> bit verbose and painful, that means your line and column numbers get
>> screwed up. So, while its fine for a quick&dirty toy like my
>> user-literal-hack, it's not something you'd want to do in a real import
>> hook for use in real code.
>> This could be solved by just changing ast.parse to accept an iterable of
>> tokens or tuples as well as a string, and likewise for compile.
>> That isn't exactly a trivial change, because under the covers the _ast
>> module is written in C, partly auto-generated, and expects as input a CST,
>> which is itself created from a different tokenizer written in C with an
>> similar but different API (since C doesn't have iterators). And adding a
>> PyTokenizer_FromIterable or something seems like it might raise some fun
>> bootstrapping issues that I haven't thought through yet. But I think it
>> ought to be doable without having to reimplement the whole parser in pure
>> Python. And I think it would be worth doing.
>> While we're at it, a few other (much smaller) changes would be nice:
>> * Allow tokenize to take a text file instead of making it take a binary
>> file and repeat the encoding detection.
>> * Allow tokenize to take a file instead of its readline method.
>> * Allow tokenize to take a str/bytes instead of requiring a file.
>> * Add flags to compile to stop at any stage (decoded text, tokens, AST,
>> or bytecode) instead of just the last two.
>> (The funny thing is that the C tokenizer actually already does support
>> strings and bytes and file objects.)
>> I realize that doing all of these changes would mean that compile can now
>> get an iterable and not know whether it's a file or a token stream until it
>> tries to iterate it. So maybe that isn't the best API; maybe it's better to
>> explicitly call tokenize, then ast.parse, then compile instead of calling
>> compile repeatedly with different flags.
>> Python-ideas mailing list
>> Python... at python.org
>> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas