<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <span dir="ltr"><<a href="mailto:abarnert@yahoo.com" target="_blank">abarnert@yahoo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Jun 5, 2015, at 13:38, Neil Girdhar <<a href="mailto:mistersheik@gmail.com">mistersheik@gmail.com</a>> wrote:<br>
><br>
> Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar.<br>
<br>
</span>Yes, and it also builds CST nodes before building the AST, and there's a step after AST validation and before bytecode generation where the symbol table for the scope is built and LEG rules are applies. But none of those are things that seem particularly useful to hook. Maybe that's just a failure of imagination, but I've never wanted to do it.<br>
<br>
Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.<br>
<br>
And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.<br></blockquote><div><br></div><div>Yes, I think I understand your motivation. Can you help me understand the what the hook you would write would look like?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module.<br>
<br>
</span>And the code generation and other post-validation steps (unless you're just trying to create a do-nothing construct--which can be useful to give you something new to, e.g., feed to MacroPy, but it's not something you're going to check in to core, or use in your own production code).<br>
<br>
So yes, changing the grammar is painful. Which is just one more reason that being able to hack on Python without having to hack on Python is very useful. And as of 3.4, all of the pieces are there to do that, and dead-easy to use, and robust enough for production code--as long as the level you want to hack on is source text, AST, or bytecode, not token stream.<br></blockquote><div><br></div><div>Yes. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> Modern parsers do not separate the grammar from tokenizing, parsing, and validation. All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies. It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized.<br>
><br>
> So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it. It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code.<br>
<br>
</span>Rewriting the entire parsing mechanism from scratch might simplify things, but it also means rewriting the entire parsing mechanism from scratch. I'm sure you could implement a GLR parser generator that takes a complete declarative grammar and generates something that goes right from source code to a SAX- or iterparse-style pre-validated AST, and that would be a really cool thing. But besides being a lot of work, it would also be a huge amount of risk. You'd almost certainly end up with new bugs, new places where syntax errors are harder to diagnose, new places where compiling is slower than it used to be, etc.</blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.<br></blockquote><div><br></div><div>I don't see why that is. The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser. Then you wouldn't need to add special hooks for lexing. You would merely have hooks for parsing.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Meanwhile, adding functions to create a token state struct out of a Python iterable, drive it, and expose that functionality to Python is a lot less work, very unlikely to have any effect on the existing default mechanism (if you don't hook the token stream, the existing code runs the same as today, except for an if check inside the next-token function), and much easier to make exactly compatible with existing behavior even when you do hook the token stream (if creating and driving the token state works, all the other code is the same as it ever was). And it has no backward compat implications.<br>
<span class=""><br>
> If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want.<br>
<br>
</span>It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).</blockquote><div><br></div><div>I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places. With every change to the Python grammar, you pay for this forced repetition.</div></div></div></div>