[Python-ideas] Hooking between lexer and parser

Fri Jun 5 22:38:27 CEST 2015

Actually CPython has another step between the AST and the bytecode, which 
validates the AST to block out trees that violate various rules that were 
not easily incorporated into the LL(1) grammar.  This means that when you 
want to change parsing, you have to change: the grammar, the AST library, 
the validation library, and Python's exposed parsing module.

Modern parsers do not separate the grammar from tokenizing, parsing, and 
validation.  All of these are done in one place, which not only simplifies 
changes to the grammar, but also protects you from possible 
inconsistencies.  It was really hard for me when I was making changes to 
the parser to keep my conception of these four things synchronized.

So in my opinion, if you're going to modernize the parsing, then put it all 
together into one simple library that deals with all of it.  It seems like 
what you're suggesting would add complexity, whereas a merged solution 
would simplify the code.  If it's hard to write a fast parser, then 
consider writing a parser generator in Python that generates the C code you 
want.

Best,

Neil

On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via 
Python-ideas wrote:
>
> Compiling a module has four steps: 
>
>  * bytes->str (based on encoding declaration or default) 
>  * str->token stream 
>  * token stream->AST 
>  * AST->bytecode 
>
> You can very easily hook at every point in that process except the token 
> stream. 
>
> There _is_ a workaround: re-encode the text to bytes, wrap it in a 
> BytesIO, call tokenize, munge the token stream, call untokenize, re-decode 
> back to text, then pass that to compile or ast.parse. But, besides being a 
> bit verbose and painful, that means your line and column numbers get 
> screwed up. So, while its fine for a quick&dirty toy like my 
> user-literal-hack, it's not something you'd want to do in a real import 
> hook for use in real code. 
>
> This could be solved by just changing ast.parse to accept an iterable of 
> tokens or tuples as well as a string, and likewise for compile. 
>
> That isn't exactly a trivial change, because under the covers the _ast 
> module is written in C, partly auto-generated, and expects as input a CST, 
> which is itself created from a different tokenizer written in C with an 
> similar but different API (since C doesn't have iterators). And adding a 
> PyTokenizer_FromIterable or something seems like it might raise some fun 
> bootstrapping issues that I haven't thought through yet. But I think it 
> ought to be doable without having to reimplement the whole parser in pure 
> Python. And I think it would be worth doing. 
>
> While we're at it, a few other (much smaller) changes would be nice: 
>
>  * Allow tokenize to take a text file instead of making it take a binary 
> file and repeat the encoding detection. 
>  * Allow tokenize to take a file instead of its readline method. 
>  * Allow tokenize to take a str/bytes instead of requiring a file. 
>  * Add flags to compile to stop at any stage (decoded text, tokens, AST, 
> or bytecode) instead of just the last two. 
>   
> (The funny thing is that the C tokenizer actually already does support 
> strings and bytes and file objects.) 
>
> I realize that doing all of these changes would mean that compile can now 
> get an iterable and not know whether it's a file or a token stream until it 
> tries to iterate it. So maybe that isn't the best API; maybe it's better to 
> explicitly call tokenize, then ast.parse, then compile instead of calling 
> compile repeatedly with different flags. 
> _______________________________________________ 
> Python-ideas mailing list 
> Python... at python.org <javascript:> 
> https://mail.python.org/mailman/listinfo/python-ideas 
> Code of Conduct: http://python.org/psf/codeofconduct/ 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150605/203fb80f/attachment.html>