[Python-ideas] Hooking between lexer and parser

Sat Jun 6 07:30:58 CEST 2015

First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue.

On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik at gmail.com> wrote:
On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik at gmail.com> wrote:

>>Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.
>>
>>And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.
>

>Yes, I think I understand your motivation.  Can you help me understand the what the hook you would write would look like?

That's a fair question.

First, let's look at the relevant parts of an AST hook that transforms every float literal into a Decimal constructor call:

    class FloatNodeWrapper(ast.NodeTransformer):
        def visit_Num(self, node):
            if isinstance(node.n, float):
                return ast.Call(func=ast.Name(id='Decimal', ctx=ast.Load()),
                                args=[ast.Str(s=repr(node.n))], keywords=[])
            return node

    # ...

    def source_to_code(self, data, path, *, _optimize=-1):
        source = importlib.decode_source(data)
        tree = ast.parse(source)
        tree = FloatNodeWrapper().visit(tree)
        ast.fix_missing_locations(tree)
        return compile(tree, path, 'exec', dont_inherit=True, optimize=_optimize)

Now, here's what I'd like to write for a token hook that does the same thing at the token level:

    def retokenize_float(tokens):
        for num, val, *loc in tokens:

            if num == tokenize.NUMBER and ('.' in val or 'e' in val or 'E' in val):
                yield tokenize.NAME, 'Decimal', *loc
                yield tokenize.OP, '(', *loc

                yield tokenize.STRING, repr(val), *loc

                yield tokenize.OP, ')', *loc

            else:
                yield num, val, *loc

    # ...

    def source_to_code(self, data, path, *, _optimize=-1):
        source = importlib.decode_source(data)
        tokens = tokenize.tokenize(source)
        tokens = retokenize(tokens)
        return compile(tokens, path, 'exec', dont_inherit=True, optimize=_optimize)

Of course I don't want to do the same thing, I want to do something that you can't do at the AST level—see my user literal hack for an example. But this shows the parallels and differences between the two. If you want more background, see 
http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html (which I wrote to explain to someone else how floatliteralhack works).

Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).

>>Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.
>

>I don't see why that is.  The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser.  Then you wouldn't need to add special hooks for lexing.  You would merely have hooks for parsing.

Callbacks from a tree builder and using a user-modifiable grammar are clearly not backward compatible with ast.NodeTransformer. They're a completely different way of doing things.

Is it a better way? Maybe. Plenty of people are using OMeta/JS every day. Of course plenty of people are cursing the fact that OMeta/JS sometimes generates exponential-time backtracking and it's never clear which part of your production rules are to blame, or the fact that you can't get a useful error message out of it, etc.

And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.

>>It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).
>
>I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places.  With every change to the Python grammar, you pay for this forced repetition.
>

The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…