[Python-ideas] Hooking between lexer and parser

Mon Jun 8 04:23:59 CEST 2015

On Sun, Jun 7, 2015 at 1:59 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas
> <python-ideas at python.org> wrote:
> > Also, if we got my change, I could write code that cleanly hooks parsing
> in
> > 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people
> can
> > at least use it, and all of the relevant and complicated code would be
> > shared between the two versions. With your change, I'd have to write code
> > that was completely different for 3.6+ than what I could backport,
> meaning
> > I'd have to write, debug, and maintain two completely different
> > implementations. And again, for no benefit.
>
> I don't think I've said this explicitly yet, but I'm +1 on the idea of
> making it easier to "hack the token stream". As Andew has noted, there
> are two reasons this is an interesting level to work at for certain
> kinds of modifications:
>
> 1. The standard Python tokeniser has already taken care of converting
> the byte stream into Unicode code points, and the code point stream
> into tokens (including replacing leading whitespace with the
> structural INDENT/DEDENT tokens)
>

I will explain in another message how to replace the indent and dedent
tokens so that the lexer loses most of its "magic" and becomes just like
the parser.

>
> 2. You get to work with a linear stream of tokens, rather than a
> precomposed tree of AST nodes that you have to traverse and keep
> consistent
>

The AST nodes would contain within them the linear stream of tokens that
you are free to work with.  The AST also encodes the structure of the
tokens, which can be very useful if only to debug how the tokens are being
parsed.  You might find yourself, when doing a more complicated lexical
transformation, trying to reverse engineer where the parse tree nodes begin
and end in the token stream.  This would be a nightmare.  This is the main
problem with trying to process the token stream "blind" to the parse tree.

>
> If all you're wanting to do is token rewriting, or to push the token
> stream over a network connection in preference to pushing raw source
> code or fully compiled bytecode, a bit of refactoring of the existing
> tokeniser/compiler interface to be less file based and more iterable
> based could make that easier to work with.
>

You can still do all that with the tokens included in the parse tree.

>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150607/e13c2d96/attachment.html>