<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <span dir="ltr"><<a href="mailto:abarnert@yahoo.com" target="_blank">abarnert@yahoo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue.<br>

<span class=""><br>

On Friday, June 5, 2015 7:08 PM, Neil Girdhar <<a href="mailto:mistersheik@gmail.com">mistersheik@gmail.com</a>> wrote:<br>

On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <<a href="mailto:abarnert@yahoo.com">abarnert@yahoo.com</a>> wrote:<br>

><br>

>On Jun 5, 2015, at 13:38, Neil Girdhar <<a href="mailto:mistersheik@gmail.com">mistersheik@gmail.com</a>> wrote:<br>

<br>

<br>

</span><span class="">>>Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.<br>

>><br>

>>And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.<br>

><br>

<br>

>Yes, I think I understand your motivation.  Can you help me understand the what the hook you would write would look like?<br>

<br>

</span>That's a fair question.<br>

<br>

First, let's look at the relevant parts of an AST hook that transforms every float literal into a Decimal constructor call:<br>

<br>

    class FloatNodeWrapper(ast.NodeTransformer):<br>

        def visit_Num(self, node):<br>

            if isinstance(node.n, float):<br>

                return ast.Call(func=ast.Name(id='Decimal', ctx=ast.Load()),<br>

                                args=[ast.Str(s=repr(node.n))], keywords=[])<br>

            return node<br>

<br>

<br>

    # ...<br>

<br>

    def source_to_code(self, data, path, *, _optimize=-1):<br>

        source = importlib.decode_source(data)<br>

        tree = ast.parse(source)<br>

        tree = FloatNodeWrapper().visit(tree)<br>

        ast.fix_missing_locations(tree)<br>

        return compile(tree, path, 'exec', dont_inherit=True, optimize=_optimize)<br>

<br>

<br>

Now, here's what I'd like to write for a token hook that does the same thing at the token level:<br>

<br>

    def retokenize_float(tokens):<br>

        for num, val, *loc in tokens:<br>

<br>

            if num == tokenize.NUMBER and ('.' in val or 'e' in val or 'E' in val):<br>

                yield tokenize.NAME, 'Decimal', *loc<br>

                yield tokenize.OP, '(', *loc<br>

<br>

                yield tokenize.STRING, repr(val), *loc<br>

<br>

                yield tokenize.OP, ')', *loc<br>

<br>

            else:<br>

                yield num, val, *loc<br>

<br>

    # ...<br>

<br>

    def source_to_code(self, data, path, *, _optimize=-1):<br>

        source = importlib.decode_source(data)<br>

        tokens = tokenize.tokenize(source)<br>

        tokens = retokenize(tokens)<br>

        return compile(tokens, path, 'exec', dont_inherit=True, optimize=_optimize)<br>

<br>

<br>

Of course I don't want to do the same thing, I want to do something that you can't do at the AST level—see my user literal hack for an example. But this shows the parallels and differences between the two. If you want more background, see<br>

<a href="http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html" target="_blank">http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html</a> (which I wrote to explain to someone else how floatliteralhack works).<br></blockquote><div><br></div><div>Yes.  I want to point that if the lexer rules were alongside the parser, they would be generating ast nodes — so the hook for calling Decimal for all floating point tokens would be doable in the same way as your AST hook.  For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).<br></blockquote><div><br></div><div>This is like saying "I want to change some things, but not other things".  I want the best long-term solution, whatever that is.  (I don't know what it is.)  In the long run, moving towards the best solution tends to be the least total work.  Specifically, if lexing hooks are implemented differently than parsing hooks, then that change is probably going to be backed out and replaced with the "ideal" — eventually — maybe in five years or ten or twenty.  And when it's removed, there's deprecation periods and upgrade pains.  At least let's explore what is the "ideal" solution?</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><br>

>>Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.<br>

><br>

<br>

>I don't see why that is.  The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser.  Then you wouldn't need to add special hooks for lexing.  You would merely have hooks for parsing.<br>

<br>

</span>Callbacks from a tree builder and using a user-modifiable grammar are clearly not backward compatible with ast.NodeTransformer. They're a completely different way of doing things.<br>

<br>

Is it a better way? Maybe. Plenty of people are using OMeta/JS every day. Of course plenty of people are cursing the fact that OMeta/JS sometimes generates exponential-time backtracking and it's never clear which part of your production rules are to blame, or the fact that you can't get a useful error message out of it, etc.<br></blockquote><div><br></div><div>I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.<br></blockquote><div> </div><div>Yes, totally agree with you.  So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python.   And then  I would try to convince the Python team to use that.  I guess we don't disagree at all.<br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><br>

>>It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).<br>

><br>

>I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places.  With every change to the Python grammar, you pay for this forced repetition.<br>

><br>

<br>

</span>The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…<br>

</blockquote></div></div><div class="gmail_extra"><br></div><div class="gmail_extra">Cool, I didn't know it was even possible.</div></div>