[Python-ideas] Hooking between lexer and parser

Mon Jun 8 04:42:31 CEST 2015

On 8 June 2015 at 12:23, Neil Girdhar <mistersheik at gmail.com> wrote:
>
>
> On Sun, Jun 7, 2015 at 1:59 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>
>> On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas
>> <python-ideas at python.org> wrote:
>> > Also, if we got my change, I could write code that cleanly hooks parsing
>> > in
>> > 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people
>> > can
>> > at least use it, and all of the relevant and complicated code would be
>> > shared between the two versions. With your change, I'd have to write
>> > code
>> > that was completely different for 3.6+ than what I could backport,
>> > meaning
>> > I'd have to write, debug, and maintain two completely different
>> > implementations. And again, for no benefit.
>>
>> I don't think I've said this explicitly yet, but I'm +1 on the idea of
>> making it easier to "hack the token stream". As Andew has noted, there
>> are two reasons this is an interesting level to work at for certain
>> kinds of modifications:
>>
>> 1. The standard Python tokeniser has already taken care of converting
>> the byte stream into Unicode code points, and the code point stream
>> into tokens (including replacing leading whitespace with the
>> structural INDENT/DEDENT tokens)
>
>
> I will explain in another message how to replace the indent and dedent
> tokens so that the lexer loses most of its "magic" and becomes just like the
> parser.

I don't dispute that this *can* be done, but what would it let me do
that I can't already do today? I addition, how will I be able to
continue to do all the things that I can do today with the separate
tokenisation step?

*Adding* steps to the compilation toolchain is doable (one of the
first things I was involved in CPython core development was the
introduction of the AST based parser in Python 2.5), but taking them
*away* is much harder.

You appear to have an idealised version of what a code generation
toolchain "should" be, and would like to hammer CPython's code
generation pipeline specifically into that mould. That's not the way
this works - we don't change the code generator for the sake of it, we
change it to solve specific problems with it.

Introducing the AST layer solved a problem. Introducing an AST
optimisation pass would solve a problem. Making the token stream
easier to manipulate would solve a problem.

Merging the lexer and the parser doesn't solve any problem that we have.

>> 2. You get to work with a linear stream of tokens, rather than a
>> precomposed tree of AST nodes that you have to traverse and keep
>> consistent
>
> The AST nodes would contain within them the linear stream of tokens that you
> are free to work with.  The AST also encodes the structure of the tokens,
> which can be very useful if only to debug how the tokens are being parsed.
> You might find yourself, when doing a more complicated lexical
> transformation, trying to reverse engineer where the parse tree nodes begin
> and end in the token stream.  This would be a nightmare.  This is the main
> problem with trying to process the token stream "blind" to the parse tree.

Anything that cares about the structure to that degree shouldn't be
manipulating the token stream - it should be working on the parse
tree.

>> If all you're wanting to do is token rewriting, or to push the token
>> stream over a network connection in preference to pushing raw source
>> code or fully compiled bytecode, a bit of refactoring of the existing
>> tokeniser/compiler interface to be less file based and more iterable
>> based could make that easier to work with.
>
> You can still do all that with the tokens included in the parse tree.

Not as easily, because I have to navigate the parse tree even when I
don't care about that structure, rather than being able to just look
at the tokens in isolation.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia