[Python-ideas] Hooking between lexer and parser

Sat Jun 6 00:58:07 CEST 2015

On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik at gmail.com> wrote:
> 
> Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar.  

Yes, and it also builds CST nodes before building the AST, and there's a step after AST validation and before bytecode generation where the symbol table for the scope is built and LEG rules are applies. But none of those are things that seem particularly useful to hook. Maybe that's just a failure of imagination, but I've never wanted to do it.

Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.

And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.

> This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module.

And the code generation and other post-validation steps (unless you're just trying to create a do-nothing construct--which can be useful to give you something new to, e.g., feed to MacroPy, but it's not something you're going to check in to core, or use in your own production code).

So yes, changing the grammar is painful. Which is just one more reason that being able to hack on Python without having to hack on Python is very useful. And as of 3.4, all of the pieces are there to do that, and dead-easy to use, and robust enough for production code--as long as the level you want to hack on is source text, AST, or bytecode, not token stream.

> Modern parsers do not separate the grammar from tokenizing, parsing, and validation.  All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies.  It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized.
> 
> So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it.  It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code.  

Rewriting the entire parsing mechanism from scratch might simplify things, but it also means rewriting the entire parsing mechanism from scratch. I'm sure you could implement a GLR parser generator that takes a complete declarative grammar and generates something that goes right from source code to a SAX- or iterparse-style pre-validated AST, and that would be a really cool thing. But besides being a lot of work, it would also be a huge amount of risk. You'd almost certainly end up with new bugs, new places where syntax errors are harder to diagnose, new places where compiling is slower than it used to be, etc.

Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.

Meanwhile, adding functions to create a token state struct out of a Python iterable, drive it, and expose that functionality to Python is a lot less work, very unlikely to have any effect on the existing default mechanism (if you don't hook the token stream, the existing code runs the same as today, except for an if check inside the next-token function), and much easier to make exactly compatible with existing behavior even when you do hook the token stream (if creating and driving the token state works, all the other code is the same as it ever was). And it has no backward compat implications.

> If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want.

It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).