[Python-ideas] Hooking between lexer and parser

Sat Jun 6 08:04:51 CEST 2015

On Sat, Jun 6, 2015 at 12:27 AM, Guido van Rossum <guido at python.org> wrote:

> On Fri, Jun 5, 2015 at 7:57 PM, Neil Girdhar <mistersheik at gmail.com>
> wrote:
>
>> I don't see why it makes anything simpler.  Your lexing rules just live
>> alongside your parsing rules.  And I also don't see why it has to be faster
>> to do the lexing in a separate part of the code.  Wouldn't the parser
>> generator realize that that some of the rules don't use the stack and so
>> they would end up just as fast as any lexer?
>>
>
> You're putting a lot of faith in "modern" parsers. I don't know if PLY
> qualifies as such, but it certainly is newer than Lex/Yacc, and it unifies
> the lexer and parser. However I don't think it would be much better for a
> language the size of Python.
>

I agree with you.  I think the problem might be that the parser that I'm
dreaming doesn't exist for Python.  In another message, I wrote what I
wanted:

—

Along with the grammar, you also give it code that it can execute as it
matches each symbol in a rule.  In Python for example, as it matches each
argument passed to a function, it would keep track of the count of *args,
**kwargs, and  keyword arguments, and regular arguments, and then raise a
syntax error if it encounters anything out of order.  Right now that check
is done in validate.c, which is really annoying.

I want to specify the lexical rules in the same way that I specify the
parsing rules.  And I think (after Andrew elucidates what he means by
hooks) I want the parsing hooks to be the same thing as lexing hooks, and I
agree with him that hooking into the lexer is useful.

I want the parser module to be automatically-generated from the grammar if
that's possible (I think it is).

Typically each grammar rule is implemented using a class.  I want the code
generation to be a method on that class.  This makes changing the AST
easy.  For example, it was suggested that we might change the grammar to
include a starstar_expr node.  This should be an easy change, but because
of the way every node validates its children, which it expects to have a
certain tree structure, it would be a big task with almost no payoff.

—

I don't think this is possible with Ply.

> We are using PLY at Dropbox to parse a medium-sized DSL, and while at the
> beginning it was convenient to have the entire language definition in one
> place, there were a fair number of subtle bugs in the earlier stages of the
> project due to the mixing of lexing and parsing. In order to get this right
> it seems you actually have to *think* about the lexing and parsing stages
> differently, and combining them in one tool doesn't actually help you to
> think more clearly.
>

That's interesting.  I can understand wanting to separate them mentally,
but two problems with separating at a fundamental programmatic level are:
(1) you may want to change a lexical token like number to — in some cases —
be LL(1) for who knows what reason; or (2) you would have to implement
lexical hooks differently than parsing hooks.  In some of Andrew's code
below, the tokenize hook loos so different than the parser hook, and I
think that's unfortunate.

>
> Also, this approach doesn't really do much for the later stages -- you can
> easily construct a parse tree but it's a fairly direct representation of
> the grammar rules, and it offers no help in managing a symbol table or
> generating code.
>

It would be nice to generate the code in methods on the classes that
implement the grammar rules.  This would allow you to use memos that were
filled in as you were parsing and validating to generate code.

>
>
> --
> --Guido van Rossum (python.org/~guido)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150606/1ead24be/attachment.html>