[Python-ideas] Hooking between lexer and parser

Neil Girdhar mistersheik at gmail.com
Sat Jun 6 07:50:28 CEST 2015


On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert at yahoo.com> wrote:

> First, I think your idea is almost completely tangential to mine. Yes, if
> you completely replaced both the interface and the implementation of the
> parser, you could do just about anything you wanted. But assuming nobody is
> going to completely replace the way Python does parsing today, I think it's
> still useful to add the one missing useful hook to the existing system. But
> let's continue.
>
> On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik at gmail.com>
> wrote:
> On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> >
> >On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik at gmail.com> wrote:
>
>
> >>Hooking the token stream, on the other hand, has pretty obvious uses.
> For example, in the user-defined literal thread, Paul Moore suggested that
> for Nick Coghlan's "compile-time expression" idea, requiring valid Python
> syntax would be way too restrictive, but requiring valid Python tokens is
> probably OK, and it automatically solves the quoting problem, and it would
> usually be easier than parsing text. I think he's right about all three
> parts of that, but unfortunately you can't implement it that way in an
> import hook because an import hook can't get access to the token stream.
> >>
> >>And of course my hack for simulating user-defined literals relies on a
> workaround to fake hooking the token stream; it would be a whole lot harder
> without that, while it would be a little easier and a whole lot cleaner if
> I could just hook the token stream.
> >
>
> >Yes, I think I understand your motivation.  Can you help me understand
> the what the hook you would write would look like?
>
> That's a fair question.
>
> First, let's look at the relevant parts of an AST hook that transforms
> every float literal into a Decimal constructor call:
>
>     class FloatNodeWrapper(ast.NodeTransformer):
>         def visit_Num(self, node):
>             if isinstance(node.n, float):
>                 return ast.Call(func=ast.Name(id='Decimal',
> ctx=ast.Load()),
>                                 args=[ast.Str(s=repr(node.n))],
> keywords=[])
>             return node
>
>
>     # ...
>
>     def source_to_code(self, data, path, *, _optimize=-1):
>         source = importlib.decode_source(data)
>         tree = ast.parse(source)
>         tree = FloatNodeWrapper().visit(tree)
>         ast.fix_missing_locations(tree)
>         return compile(tree, path, 'exec', dont_inherit=True,
> optimize=_optimize)
>
>
> Now, here's what I'd like to write for a token hook that does the same
> thing at the token level:
>
>     def retokenize_float(tokens):
>         for num, val, *loc in tokens:
>
>             if num == tokenize.NUMBER and ('.' in val or 'e' in val or 'E'
> in val):
>                 yield tokenize.NAME, 'Decimal', *loc
>                 yield tokenize.OP, '(', *loc
>
>                 yield tokenize.STRING, repr(val), *loc
>
>                 yield tokenize.OP, ')', *loc
>
>             else:
>                 yield num, val, *loc
>
>     # ...
>
>     def source_to_code(self, data, path, *, _optimize=-1):
>         source = importlib.decode_source(data)
>         tokens = tokenize.tokenize(source)
>         tokens = retokenize(tokens)
>         return compile(tokens, path, 'exec', dont_inherit=True,
> optimize=_optimize)
>
>
> Of course I don't want to do the same thing, I want to do something that
> you can't do at the AST level—see my user literal hack for an example. But
> this shows the parallels and differences between the two. If you want more
> background, see
>
> http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html
> (which I wrote to explain to someone else how floatliteralhack works).
>

Yes.  I want to point that if the lexer rules were alongside the parser,
they would be generating ast nodes — so the hook for calling Decimal for
all floating point tokens would be doable in the same way as your AST
hook.  For the new tokens that you want, the ideal solution I think is to
modify the python parsing grammar before it parses the text.


> Of course I'm not presenting this as an ideal design if I were starting
> Python from scratch, but as the best design given what Python is already
> producing and consuming (a stream of tokens that's guaranteed to be
> equivalent to what you get out of the tokenize module).
>

This is like saying "I want to change some things, but not other things".
I want the best long-term solution, whatever that is.  (I don't know what
it is.)  In the long run, moving towards the best solution tends to be the
least total work.  Specifically, if lexing hooks are implemented
differently than parsing hooks, then that change is probably going to be
backed out and replaced with the "ideal" — eventually — maybe in five years
or ten or twenty.  And when it's removed, there's deprecation periods and
upgrade pains.  At least let's explore what is the "ideal" solution?

>
> >>Also, Python is defined as hacking a separate lexical analysis phase,
> and a module named tokenize that does the same thing as this phase, and
> tests that test it, and so on. So, you can't just throw all that out and
> remain backward compatible.
> >
>
> >I don't see why that is.  The lexical "phase" would just become new
> parsing rules, and so it would be supplanted by the parser.  Then you
> wouldn't need to add special hooks for lexing.  You would merely have hooks
> for parsing.
>
> Callbacks from a tree builder and using a user-modifiable grammar are
> clearly not backward compatible with ast.NodeTransformer. They're a
> completely different way of doing things.
>
> Is it a better way? Maybe. Plenty of people are using OMeta/JS every day.
> Of course plenty of people are cursing the fact that OMeta/JS sometimes
> generates exponential-time backtracking and it's never clear which part of
> your production rules are to blame, or the fact that you can't get a useful
> error message out of it, etc.
>

I don't know about OMeta, but the Earley parsing algorithm is worst-cast
cubic time "quadratic time for unambiguous grammars, and linear time for
almost all LR(k) grammars".

>
> And I'm pretty sure you could design something with most of the strengths
> of OMeta without its weaknesses (just using a standard packrat PEG parser
> instead of an extended PEG parser seems like it would turn most of the
> exponential productions into explicit errors in the grammar…). Or you could
> go the opposite way and use GLR and bottom-up callbacks instead of PEG and
> top-down. Something like that would be a great idea for a research project.
> But it's probably not a great idea for a proposal to change the core of
> CPython, at least until someone does that research project.
>

Yes, totally agree with you.  So if it were me doing this work, I would put
my energy in the research project to write an amazing parser in Python.
And then  I would try to convince the Python team to use that.  I guess we
don't disagree at all.


> >>It's not that it's hard to write a _fast_ parser in Python, but that
> it's hard to write a parser that does the exact same thing as Python's own
> parser (and even more so one that does the exact same thing as whatever
> version of Python you're running under's parser).
> >
> >I think it's worth exploring having the whole parser in one place, rather
> than repeating the same structures in at least four places.  With every
> change to the Python grammar, you pay for this forced repetition.
> >
>
> The repetition is really a different issue. A different implementation of
> the same basic design Python already has could make it so you only have to
> write explicit code for the 3 places the CST->AST node doesn't follow the
> same rules as everywhere else and the dozen or so places where the AST has
> to be post-validated, instead of having to write explicit code for both
> sides of every single node type. And that kind of cleanup could be done
> without breaking backward compatibility, because the interfaces on each
> side of the code would be unchanged. But that's also a lot less fun of a
> change than writing a whole new parser, so I wouldn't be surprised if
> nobody ever did it…
>

Cool, I didn't know it was even possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150606/04a275a1/attachment-0001.html>


More information about the Python-ideas mailing list