[Python-ideas] Hooking between lexer and parser

Sat Jun 6 04:08:03 CEST 2015

On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik at gmail.com> wrote:
> >
> > Actually CPython has another step between the AST and the bytecode,
> which validates the AST to block out trees that violate various rules that
> were not easily incorporated into the LL(1) grammar.
>
> Yes, and it also builds CST nodes before building the AST, and there's a
> step after AST validation and before bytecode generation where the symbol
> table for the scope is built and LEG rules are applies. But none of those
> are things that seem particularly useful to hook. Maybe that's just a
> failure of imagination, but I've never wanted to do it.
>
> Hooking the token stream, on the other hand, has pretty obvious uses. For
> example, in the user-defined literal thread, Paul Moore suggested that for
> Nick Coghlan's "compile-time expression" idea, requiring valid Python
> syntax would be way too restrictive, but requiring valid Python tokens is
> probably OK, and it automatically solves the quoting problem, and it would
> usually be easier than parsing text. I think he's right about all three
> parts of that, but unfortunately you can't implement it that way in an
> import hook because an import hook can't get access to the token stream.
>
> And of course my hack for simulating user-defined literals relies on a
> workaround to fake hooking the token stream; it would be a whole lot harder
> without that, while it would be a little easier and a whole lot cleaner if
> I could just hook the token stream.
>

Yes, I think I understand your motivation.  Can you help me understand the
what the hook you would write would look like?

>
> > This means that when you want to change parsing, you have to change: the
> grammar, the AST library, the validation library, and Python's exposed
> parsing module.
>
> And the code generation and other post-validation steps (unless you're
> just trying to create a do-nothing construct--which can be useful to give
> you something new to, e.g., feed to MacroPy, but it's not something you're
> going to check in to core, or use in your own production code).
>
> So yes, changing the grammar is painful. Which is just one more reason
> that being able to hack on Python without having to hack on Python is very
> useful. And as of 3.4, all of the pieces are there to do that, and
> dead-easy to use, and robust enough for production code--as long as the
> level you want to hack on is source text, AST, or bytecode, not token
> stream.
>

Yes.

>
> > Modern parsers do not separate the grammar from tokenizing, parsing, and
> validation.  All of these are done in one place, which not only simplifies
> changes to the grammar, but also protects you from possible
> inconsistencies.  It was really hard for me when I was making changes to
> the parser to keep my conception of these four things synchronized.
> >
> > So in my opinion, if you're going to modernize the parsing, then put it
> all together into one simple library that deals with all of it.  It seems
> like what you're suggesting would add complexity, whereas a merged solution
> would simplify the code.
>
> Rewriting the entire parsing mechanism from scratch might simplify things,
> but it also means rewriting the entire parsing mechanism from scratch. I'm
> sure you could implement a GLR parser generator that takes a complete
> declarative grammar and generates something that goes right from source
> code to a SAX- or iterparse-style pre-validated AST, and that would be a
> really cool thing. But besides being a lot of work, it would also be a huge
> amount of risk. You'd almost certainly end up with new bugs, new places
> where syntax errors are harder to diagnose, new places where compiling is
> slower than it used to be, etc.

> Also, Python is defined as hacking a separate lexical analysis phase, and
> a module named tokenize that does the same thing as this phase, and tests
> that test it, and so on. So, you can't just throw all that out and remain
> backward compatible.
>

I don't see why that is.  The lexical "phase" would just become new parsing
rules, and so it would be supplanted by the parser.  Then you wouldn't need
to add special hooks for lexing.  You would merely have hooks for parsing.

>
> Meanwhile, adding functions to create a token state struct out of a Python
> iterable, drive it, and expose that functionality to Python is a lot less
> work, very unlikely to have any effect on the existing default mechanism
> (if you don't hook the token stream, the existing code runs the same as
> today, except for an if check inside the next-token function), and much
> easier to make exactly compatible with existing behavior even when you do
> hook the token stream (if creating and driving the token state works, all
> the other code is the same as it ever was). And it has no backward compat
> implications.
>
> > If it's hard to write a fast parser, then consider writing a parser
> generator in Python that generates the C code you want.
>
> It's not that it's hard to write a _fast_ parser in Python, but that it's
> hard to write a parser that does the exact same thing as Python's own
> parser (and even more so one that does the exact same thing as whatever
> version of Python you're running under's parser).

I think it's worth exploring having the whole parser in one place, rather
than repeating the same structures in at least four places.  With every
change to the Python grammar, you pay for this forced repetition.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150605/e501d378/attachment-0001.html>