[Python-ideas] Hooking between lexer and parser

Sun Jun 7 15:24:27 CEST 2015

On Jun 7, 2015, at 03:30, random832 at fastmail.us wrote:
> 
>> On Sun, Jun 7, 2015, at 01:59, Nick Coghlan wrote:
>> 1. The standard Python tokeniser has already taken care of converting
>> the byte stream into Unicode code points, and the code point stream
>> into tokens (including replacing leading whitespace with the
>> structural INDENT/DEDENT tokens)
> 
> Remember that balanced brackets are important for this INDENT/DEDENT
> transformation. What should the parser do with indentation in the
> presence of a hook that consumes a sequence containing unbalanced or
> mixed brackets?

I'm pretty sure that just doing nothing special here means you get a SyntaxError from the parser. Although I probably need more test cases.

Anyway, this is one of those cases I mentioned where the SyntaxError can't actually show you what's wrong with the code, because the actual source doesn't have an error in it, only the transformed token stream. But there are easier ways to get that--just replace a `None` with a `with` in the token stream and you get an error that shows you a perfectly valid line, with no indication that a hook has screwed things up for you.

I think we can at least detect that the tokens don't match the source line and throw in a note to go look for an installed token-transforming hook. It would be even nicer if we could show what the untokenized line looks like, so the user can see why it's an error. Something like this:

      File "<input>", line 1
        if spam is None:
                   ^
    SyntaxError: invalid syntax
    Tokens do not match input, parsed as
        if spam is with :    

Of course in the specific case you mentioned of unbalanced parens swallowing a dedent, the output still wouldn't be useful, but I'm not sure what we could show usefully in that case anyway.