[Python-ideas] Hooking between lexer and parser
abarnert at yahoo.com
Sat Jun 6 09:17:47 CEST 2015
> On Jun 5, 2015, at 22:50, Neil Girdhar <mistersheik at gmail.com> wrote:
>> On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue.
>> On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik at gmail.com> wrote:
>> On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> If you want more background, see
>> http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html (which I wrote to explain to someone else how floatliteralhack works).
> Yes. I want to point that if the lexer rules were alongside the parser, they would be generating ast nodes — so the hook for calling Decimal for all floating point tokens would be doable in the same way as your AST hook.
No. The way Python currently exposes things, the AST hook runs on an already-generated AST and transforms it into another one, to hand off to the code generator. That means it can only be used to handle things that parse as legal Python syntax (unless you replace the entire parser).
What I want is a way to similarly take an already-generated token stream and transform it into another one, to hand off to the parser. That will allow it to be used to handle things that lex as legal Python tokens but don't parse as legal Python syntax, like what Paul suggested. Merging lexing into parsing not only doesn't give me that, it makes that impossible.
> For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
But I don't want any new tokens. I just want to change the way existing tokens are interpreted.
Just as with an AST hook like PyMacro, I don't want any new nodes, I just want to change the way existing nodes are interpreted.
>> Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).
> This is like saying "I want to change some things, but not other things".
That's exactly what I'm saying. In particular, I want to change as few things as possible, to get what I want, without breaking stuff that already demonstrably works and has worked for decades.
> I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
I don't know why you'd want to use Earley for parsing a programming language. IIRC, it was the first algorithm that could handle rampant ambiguity in polynomial time, but that isn't relevant to parsing programming languages (especially one like Python, which was explicitly designed to be simple to parse), and it isn't relevant to natural languages if you're not still in the 1960s, except in learning the theory and history of parsing. GLR does much better in almost-unambiguous/almost-deterministic languages; CYK can be easily extended with weights (which propagate sensibly, so you can use them for a final judgment, or to heuristically prune alternatives as you go); Valiant is easier to reason about mathematically; etc. And that's just among the parsers in the same basic family as Earley.
>> And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.
> Yes, totally agree with you. So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python. And then I would try to convince the Python team to use that. I guess we don't disagree at all.
Well, I think we disagree about the value of our time, and about the cost of disruptive changes.
If I have a relatively low-work, almost completely non-disruptive way to definitely get everything I actually need, and a high-work, hugely-disruptive way to probably get what I actually need and also probably get a whole bunch of other useful stuff that I might be able to sell everyone else on if I also did a lot of additional work, that seems like a no-brainer to me.
In fact, even if I wanted to write an amazing parser library for Python (and I kind of do, but I don't know if I have the time), I still don't think I'd want to suggest it as a replacement for the parser in CPython. Writing all the backward-compat adapters and porting the Python parser over with all its quirks intact and building the tests to prove that it's performance and error handling were strictly better and so on wouldn't be nearly as much fun as other things I could do with it.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas