[Python-ideas] Hooking between lexer and parser

Mon Jun 8 04:20:06 CEST 2015

On Sat, Jun 6, 2015 at 6:52 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> On Jun 6, 2015, at 09:23, Neil Girdhar <mistersheik at gmail.com> wrote:
>
> On Sat, Jun 6, 2015 at 3:17 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>> On Jun 5, 2015, at 22:50, Neil Girdhar <mistersheik at gmail.com> wrote:
>>
>> On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert at yahoo.com>
>> wrote:
>>
>>> First, I think your idea is almost completely tangential to mine. Yes,
>>> if you completely replaced both the interface and the implementation of the
>>> parser, you could do just about anything you wanted. But assuming nobody is
>>> going to completely replace the way Python does parsing today, I think it's
>>> still useful to add the one missing useful hook to the existing system. But
>>> let's continue.
>>>
>>> On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik at gmail.com>
>>> wrote:
>>> On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert at yahoo.com>
>>> wrote:
>>> >
>>> If you want more background, see
>>>
>>> http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking-python.html
>>> (which I wrote to explain to someone else how floatliteralhack works).
>>>
>>
>> Yes.  I want to point that if the lexer rules were alongside the parser,
>> they would be generating ast nodes — so the hook for calling Decimal for
>> all floating point tokens would be doable in the same way as your AST hook.
>>
>>
>> No. The way Python currently exposes things, the AST hook runs on an
>> already-generated AST and transforms it into another one, to hand off to
>> the code generator. That means it can only be used to handle things that
>> parse as legal Python syntax (unless you replace the entire parser).
>>
>> What I want is a way to similarly take an already-generated token stream
>> and transform it into another one, to hand off to the parser. That will
>> allow it to be used to handle things that lex as legal Python tokens but
>> don't parse as legal Python syntax, like what Paul suggested. Merging
>> lexing into parsing not only doesn't give me that, it makes that impossible.
>>
>
> Yes, and I what I was suggesting is for the lexer to return AST nodes, so
> it would be fine to process those nodes in the same way.
>
>
> Seriously?
>
> Tokens don't form a tree, they form a list. Yes, every linked list is just
> a degenerate tree, so you could have every "node" just include the next one
> as a child. But why? Do you want to then the input text into a tree of
> character nodes?
>

When the work done by the lexer is done by the parser, the characters
contained in a lexical node will be siblings.  Typically, in Python a tree
is represented as nodes with iterable children, so the characters would
just be a string.

>
> Python has all kinds of really good tools for dealing with iterables; why
> take away those tools and force me to work with a more complicated
> abstraction that Python doesn't have any tools for dealing with?
>

The stream would still be just an iterable.

>
> In the case of the user-defined literal hack, for example, I can use the
> adjacent-pairs recipe from itertools and my transformation becomes trivial.
> I did it more explicitly in the hack I uploaded, using a generator function
> with a for statement, just to make it blindingly obvious what's happening.
> But if I had to deal with a tree, I'd either have to write explicit
> lookahead or store some state explicitly on the tree or the visitor. That
> isn't exactly _hard_, but it's certainly _harder_, and for no benefit.
>
> Also, if we got my change, I could write code that cleanly hooks parsing
> in 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people
> can at least use it, and all of the relevant and complicated code would be
> shared between the two versions. With your change, I'd have to write code
> that was completely different for 3.6+ than what I could backport, meaning
> I'd have to write, debug, and maintain two completely different
> implementations. And again, for no benefit.
>
> And finally, once again: we already have a token stream as part of the
> process, we already expose every other interesting step in the process,
> exposing the token stream as it exists today in a way that fits into
> everything else as it exists today is clearly the easiest and least
> disruptive thing to do. Sometimes it's worth doing harder or more
> disruptive things because they provide more benefit, but you haven't yet
> shown any benefit.
>
>
You would still be able to do all this stuff.

> You asked me for examples, and I provided them. Why don't you try writing
> a couple of actual examples--user literals, the LINQ-ish example from
> MacroPy, whatever--using your proposed design to show us how they could be
> simpler, or more elegant, or open up further possibilities. Or come up with
> an example of something your design could do that the existing one (even
> with my small proposed change) can't.
>
>
If I find time, I'll do that.  I will explain my solution in another
message.

> For the new tokens that you want, the ideal solution I think is to modify
>> the python parsing grammar before it parses the text.
>>
>>
>> But I don't want any new tokens. I just want to change the way existing
>> tokens are interpreted.
>>
>> Just as with an AST hook like PyMacro, I don't want any new nodes, I just
>> want to change the way existing nodes are interpreted.
>>
>>
> Yes, I see *how* you're trying to solve your problem, but my preference is
> to have one kind of hook rather than two kinds by unifying lexing and
> parsing.  I think that's more elegant.
>
>
> I'm trying to find a way to interpret this that makes sense. I think
> you're suggesting that we should throw out the idea of letting users write
> and install simple post-processing hooks in Python, because that will force
> us to find a way to instead make the entire parsing process
> user-customizable at runtime, which will force users to come up with "more
> elegant" solutions involving changing the grammar instead of
> post-processing it macro-style.
>
> If so, I think that's a very bad idea. Decades of practice in both Python
> and many other languages (especially those with built-in macro facilities)
> shows that post-processing at the relevant level is generally simple and
> elegant. Even if we had a fully-runtime-customizable parser, something like
> OMeta but "closing the loop" and implementing the language in the
> programmable metalanguage, many things are still simpler and more elegant
> written post-processing style (as used by existing import hooks, including
> MacroPy, and in other languages going all the way back to Lisp), and
> there's a much lower barrier to learning them, and there's much less risk
> of breaking the compiler/interpreter being used to run your hook in the
> first place. And, even if none of that were true, and your new and improved
> system really were simpler in every case, and you had actually built it
> rather than just envisioning it, there's still backward compatibility to
> think of. Do you really want to break working, documented functionality
> that people have written things like MacroPy on top of, even if forcing
> them to redesign and rewrite everything from scratch would force them to
> come up with a "more elegant" solution? And finally, the added flexibility
> of such a system is a cost as well as a benefit--the fact that Arc makes it
> as easy as possible to "rewrite the language into one that makes writing
> your application trivial" also means that one Arc programmer can't
> understand another's code until putting in a lot of effort to learn his
> idiosyncratic language.
>

I understand that you are motivated by a specific problem.  However, your
solution does not solve the general problem.  If you only allow
transformations of the token stream, the token set is fixed.
Transformations of the token stream also hide — even in your example — the
fact that you're actually building what is conceptually a subtree.

It makes more sense to me to solve the problem in general, once and for
all.  (1) Make it easy to change the grammar, and (2) make lexing part of
the grammar.  Now, you don't have to change the grammar to solve some
problems.  Sometimes, you can just use AST transformers to accomplish what
you were doing with a lexical transformer.  That's nice because it's one
less thing to learn.

Sometimes, you need the power of changing the grammar.  That is already
coming with the popularity of languages like Theano.  I really want to
transform Python code into Theano, and for that it may be more elegant to
change the grammar.

>
> I don't know about OMeta, but the Earley parsing algorithm is worst-cast
>> cubic time "quadratic time for unambiguous grammars, and linear time for
>> almost all LR(k) grammars".
>>
>>
>> I don't know why you'd want to use Earley for parsing a programming
>> language. IIRC, it was the first algorithm that could handle rampant
>> ambiguity in polynomial time, but that isn't relevant to parsing
>> programming languages (especially one like Python, which was explicitly
>> designed to be simple to parse), and it isn't relevant to natural languages
>> if you're not still in the 1960s, except in learning the theory and history
>> of parsing. GLR does much better in almost-unambiguous/almost-deterministic
>> languages; CYK can be easily extended with weights (which propagate
>> sensibly, so you can use them for a final judgment, or to heuristically
>> prune alternatives as you go); Valiant is easier to reason about
>> mathematically; etc. And that's just among the parsers in the same basic
>> family as Earley.
>>
>
> I suggested Earley to mitigate this fear of "exponential backtracking"
> since that won't happen in Earley.
>
>
> I already explained that using standard PEG with a packrat parser instead
> of extended PEG with an OMeta-style parser gives you linear time. Why do
> you think telling me about a decades-older cubic-time algorithm designed
> for parsing natural languages that's a direct ancestor to two other
> algorithms I also already mentioned is going to be helpful? Do you not
> understand the advantages of PEG or GLR over Earley?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150607/22d866c7/attachment-0001.html>