[Python-ideas] Hooking between lexer and parser

Andrew Barnert abarnert at yahoo.com
Sun Jun 7 00:52:22 CEST 2015


On Jun 6, 2015, at 09:23, Neil Girdhar <mistersheik at gmail.com> wrote:
> 
>> On Sat, Jun 6, 2015 at 3:17 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>>> On Jun 5, 2015, at 22:50, Neil Girdhar <mistersheik at gmail.com> wrote:
>>> 
>>>> On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>>>> First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue.
>>>> 
>>>> On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik at gmail.com> wrote:
>>>> On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>>>> >
>>>> If you want more background, see
>>>> http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html (which I wrote to explain to someone else how floatliteralhack works).
>>> 
>>> Yes.  I want to point that if the lexer rules were alongside the parser, they would be generating ast nodes — so the hook for calling Decimal for all floating point tokens would be doable in the same way as your AST hook. 
>> 
>> No. The way Python currently exposes things, the AST hook runs on an already-generated AST and transforms it into another one, to hand off to the code generator. That means it can only be used to handle things that parse as legal Python syntax (unless you replace the entire parser).
>> 
>> What I want is a way to similarly take an already-generated token stream and transform it into another one, to hand off to the parser. That will allow it to be used to handle things that lex as legal Python tokens but don't parse as legal Python syntax, like what Paul suggested. Merging lexing into parsing not only doesn't give me that, it makes that impossible.
> 
> Yes, and I what I was suggesting is for the lexer to return AST nodes, so it would be fine to process those nodes in the same way. 

Seriously?

Tokens don't form a tree, they form a list. Yes, every linked list is just a degenerate tree, so you could have every "node" just include the next one as a child. But why? Do you want to then the input text into a tree of character nodes?

Python has all kinds of really good tools for dealing with iterables; why take away those tools and force me to work with a more complicated abstraction that Python doesn't have any tools for dealing with?

In the case of the user-defined literal hack, for example, I can use the adjacent-pairs recipe from itertools and my transformation becomes trivial. I did it more explicitly in the hack I uploaded, using a generator function with a for statement, just to make it blindingly obvious what's happening. But if I had to deal with a tree, I'd either have to write explicit lookahead or store some state explicitly on the tree or the visitor. That isn't exactly _hard_, but it's certainly _harder_, and for no benefit.

Also, if we got my change, I could write code that cleanly hooks parsing in 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people can at least use it, and all of the relevant and complicated code would be shared between the two versions. With your change, I'd have to write code that was completely different for 3.6+ than what I could backport, meaning I'd have to write, debug, and maintain two completely different implementations. And again, for no benefit.

And finally, once again: we already have a token stream as part of the process, we already expose every other interesting step in the process, exposing the token stream as it exists today in a way that fits into everything else as it exists today is clearly the easiest and least disruptive thing to do. Sometimes it's worth doing harder or more disruptive things because they provide more benefit, but you haven't yet shown any benefit.

You asked me for examples, and I provided them. Why don't you try writing a couple of actual examples--user literals, the LINQ-ish example from MacroPy, whatever--using your proposed design to show us how they could be simpler, or more elegant, or open up further possibilities. Or come up with an example of something your design could do that the existing one (even with my small proposed change) can't.

>>> For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
>> 
>> But I don't want any new tokens. I just want to change the way existing tokens are interpreted.
>> 
>> Just as with an AST hook like PyMacro, I don't want any new nodes, I just want to change the way existing nodes are interpreted.
> 
> Yes, I see *how* you're trying to solve your problem, but my preference is to have one kind of hook rather than two kinds by unifying lexing and parsing.  I think that's more elegant.

I'm trying to find a way to interpret this that makes sense. I think you're suggesting that we should throw out the idea of letting users write and install simple post-processing hooks in Python, because that will force us to find a way to instead make the entire parsing process user-customizable at runtime, which will force users to come up with "more elegant" solutions involving changing the grammar instead of post-processing it macro-style. 

If so, I think that's a very bad idea. Decades of practice in both Python and many other languages (especially those with built-in macro facilities) shows that post-processing at the relevant level is generally simple and elegant. Even if we had a fully-runtime-customizable parser, something like OMeta but "closing the loop" and implementing the language in the programmable metalanguage, many things are still simpler and more elegant written post-processing style (as used by existing import hooks, including MacroPy, and in other languages going all the way back to Lisp), and there's a much lower barrier to learning them, and there's much less risk of breaking the compiler/interpreter being used to run your hook in the first place. And, even if none of that were true, and your new and improved system really were simpler in every case, and you had actually built it rather than just envisioning it, there's still backward compatibility to think of. Do you really want to break working, documented functionality that people have written things like MacroPy on top of, even if forcing them to redesign and rewrite everything from scratch would force them to come up with a "more elegant" solution? And finally, the added flexibility of such a system is a cost as well as a benefit--the fact that Arc makes it as easy as possible to "rewrite the language into one that makes writing your application trivial" also means that one Arc programmer can't understand another's code until putting in a lot of effort to learn his idiosyncratic language.

>>> I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
>> 
>> I don't know why you'd want to use Earley for parsing a programming language. IIRC, it was the first algorithm that could handle rampant ambiguity in polynomial time, but that isn't relevant to parsing programming languages (especially one like Python, which was explicitly designed to be simple to parse), and it isn't relevant to natural languages if you're not still in the 1960s, except in learning the theory and history of parsing. GLR does much better in almost-unambiguous/almost-deterministic languages; CYK can be easily extended with weights (which propagate sensibly, so you can use them for a final judgment, or to heuristically prune alternatives as you go); Valiant is easier to reason about mathematically; etc. And that's just among the parsers in the same basic family as Earley.
> 
> I suggested Earley to mitigate this fear of "exponential backtracking" since that won't happen in Earley.

I already explained that using standard PEG with a packrat parser instead of extended PEG with an OMeta-style parser gives you linear time. Why do you think telling me about a decades-older cubic-time algorithm designed for parsing natural languages that's a direct ancestor to two other algorithms I also already mentioned is going to be helpful? Do you not understand the advantages of PEG or GLR over Earley?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150606/5e819165/attachment-0001.html>


More information about the Python-ideas mailing list