<div dir="ltr"><div>Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar.  This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module.</div><div><br></div><div>Modern parsers do not separate the grammar from tokenizing, parsing, and validation.  All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies.  It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized.</div><div><br></div><div>So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it.  It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code.  If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want.<br></div><div><br></div><div>Best,</div><div><br></div><div>Neil</div><div><br>On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via Python-ideas wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Compiling a module has four steps:

<br>

<br> * bytes->str (based on encoding declaration or default)

<br> * str->token stream

<br> * token stream->AST

<br> * AST->bytecode

<br>

You can very easily hook at every point in that process except the token stream.

<br>

There _is_ a workaround: re-encode the text to bytes, wrap it in a BytesIO, call tokenize, munge the token stream, call untokenize, re-decode back to text, then pass that to compile or ast.parse. But, besides being a bit verbose and painful, that means your line and column numbers get screwed up. So, while its fine for a quick&dirty toy like my user-literal-hack, it's not something you'd want to do in a real import hook for use in real code.

<br>

This could be solved by just changing ast.parse to accept an iterable of tokens or tuples as well as a string, and likewise for compile.

<br>

That isn't exactly a trivial change, because under the covers the _ast module is written in C, partly auto-generated, and expects as input a CST, which is itself created from a different tokenizer written in C with an similar but different API (since C doesn't have iterators). And adding a PyTokenizer_FromIterable or something seems like it might raise some fun bootstrapping issues that I haven't thought through yet. But I think it ought to be doable without having to reimplement the whole parser in pure Python. And I think it would be worth doing.

<br>

<br>While we're at it, a few other (much smaller) changes would be nice:

<br>

* Allow tokenize to take a text file instead of making it take a binary file and repeat the encoding detection.

* Allow tokenize to take a file instead of its readline method.

* Allow tokenize to take a str/bytes instead of requiring a file.

* Add flags to compile to stop at any stage (decoded text, tokens, AST, or bytecode) instead of just the last two.

<br> 

<br>(The funny thing is that the C tokenizer actually already does support strings and bytes and file objects.)

<br>

I realize that doing all of these changes would mean that compile can now get an iterable and not know whether it's a file or a token stream until it tries to iterate it. So maybe that isn't the best API; maybe it's better to explicitly call tokenize, then ast.parse, then compile instead of calling compile repeatedly with different flags.

<br>______________________________<wbr>_________________

<br>Python-ideas mailing list

<br><a href="javascript:" target="_blank" gdf-obfuscated-mailto="xtDCSSjDFTIJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">Python...@python.org</a>

<br><a href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\46sa\75D\46sntz\0751\46usg\75AFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\46sa\75D\46sntz\0751\46usg\75AFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;">https://mail.python.org/<wbr>mailman/listinfo/python-ideas</a>

<br>Code of Conduct: <a href="http://python.org/psf/codeofconduct/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;">http://python.org/psf/<wbr>codeofconduct/</a>

<br></blockquote></div></div>