[Python-ideas] Hooking between lexer and parser

Fri Jun 5 22:40:04 CEST 2015

While we're at it, we can also fix "(1if 0 else 2)" :)

On Friday, June 5, 2015 at 4:38:27 PM UTC-4, Neil Girdhar wrote:
>
> Actually CPython has another step between the AST and the bytecode, which 
> validates the AST to block out trees that violate various rules that were 
> not easily incorporated into the LL(1) grammar.  This means that when you 
> want to change parsing, you have to change: the grammar, the AST library, 
> the validation library, and Python's exposed parsing module.
>
> Modern parsers do not separate the grammar from tokenizing, parsing, and 
> validation.  All of these are done in one place, which not only simplifies 
> changes to the grammar, but also protects you from possible 
> inconsistencies.  It was really hard for me when I was making changes to 
> the parser to keep my conception of these four things synchronized.
>
> So in my opinion, if you're going to modernize the parsing, then put it 
> all together into one simple library that deals with all of it.  It seems 
> like what you're suggesting would add complexity, whereas a merged solution 
> would simplify the code.  If it's hard to write a fast parser, then 
> consider writing a parser generator in Python that generates the C code you 
> want.
>
> Best,
>
> Neil
>
> On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via 
> Python-ideas wrote:
>>
>> Compiling a module has four steps: 
>>
>>  * bytes->str (based on encoding declaration or default) 
>>  * str->token stream 
>>  * token stream->AST 
>>  * AST->bytecode 
>>
>> You can very easily hook at every point in that process except the token 
>> stream. 
>>
>> There _is_ a workaround: re-encode the text to bytes, wrap it in a 
>> BytesIO, call tokenize, munge the token stream, call untokenize, re-decode 
>> back to text, then pass that to compile or ast.parse. But, besides being a 
>> bit verbose and painful, that means your line and column numbers get 
>> screwed up. So, while its fine for a quick&dirty toy like my 
>> user-literal-hack, it's not something you'd want to do in a real import 
>> hook for use in real code. 
>>
>> This could be solved by just changing ast.parse to accept an iterable of 
>> tokens or tuples as well as a string, and likewise for compile. 
>>
>> That isn't exactly a trivial change, because under the covers the _ast 
>> module is written in C, partly auto-generated, and expects as input a CST, 
>> which is itself created from a different tokenizer written in C with an 
>> similar but different API (since C doesn't have iterators). And adding a 
>> PyTokenizer_FromIterable or something seems like it might raise some fun 
>> bootstrapping issues that I haven't thought through yet. But I think it 
>> ought to be doable without having to reimplement the whole parser in pure 
>> Python. And I think it would be worth doing. 
>>
>> While we're at it, a few other (much smaller) changes would be nice: 
>>
>>  * Allow tokenize to take a text file instead of making it take a binary 
>> file and repeat the encoding detection. 
>>  * Allow tokenize to take a file instead of its readline method. 
>>  * Allow tokenize to take a str/bytes instead of requiring a file. 
>>  * Add flags to compile to stop at any stage (decoded text, tokens, AST, 
>> or bytecode) instead of just the last two. 
>>   
>> (The funny thing is that the C tokenizer actually already does support 
>> strings and bytes and file objects.) 
>>
>> I realize that doing all of these changes would mean that compile can now 
>> get an iterable and not know whether it's a file or a token stream until it 
>> tries to iterate it. So maybe that isn't the best API; maybe it's better to 
>> explicitly call tokenize, then ast.parse, then compile instead of calling 
>> compile repeatedly with different flags. 
>> _______________________________________________ 
>> Python-ideas mailing list 
>> Python... at python.org 
>> https://mail.python.org/mailman/listinfo/python-ideas 
>> Code of Conduct: http://python.org/psf/codeofconduct/ 
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150605/869174bd/attachment-0001.html>