Hooking between lexer and parser

Compiling a module has four steps: * bytes->str (based on encoding declaration or default) * str->token stream * token stream->AST * AST->bytecode You can very easily hook at every point in that process except the token stream. There _is_ a workaround: re-encode the text to bytes, wrap it in a BytesIO, call tokenize, munge the token stream, call untokenize, re-decode back to text, then pass that to compile or ast.parse. But, besides being a bit verbose and painful, that means your line and column numbers get screwed up. So, while its fine for a quick&dirty toy like my user-literal-hack, it's not something you'd want to do in a real import hook for use in real code. This could be solved by just changing ast.parse to accept an iterable of tokens or tuples as well as a string, and likewise for compile. That isn't exactly a trivial change, because under the covers the _ast module is written in C, partly auto-generated, and expects as input a CST, which is itself created from a different tokenizer written in C with an similar but different API (since C doesn't have iterators). And adding a PyTokenizer_FromIterable or something seems like it might raise some fun bootstrapping issues that I haven't thought through yet. But I think it ought to be doable without having to reimplement the whole parser in pure Python. And I think it would be worth doing. While we're at it, a few other (much smaller) changes would be nice: * Allow tokenize to take a text file instead of making it take a binary file and repeat the encoding detection. * Allow tokenize to take a file instead of its readline method. * Allow tokenize to take a str/bytes instead of requiring a file. * Add flags to compile to stop at any stage (decoded text, tokens, AST, or bytecode) instead of just the last two. (The funny thing is that the C tokenizer actually already does support strings and bytes and file objects.) I realize that doing all of these changes would mean that compile can now get an iterable and not know whether it's a file or a token stream until it tries to iterate it. So maybe that isn't the best API; maybe it's better to explicitly call tokenize, then ast.parse, then compile instead of calling compile repeatedly with different flags.

Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar. This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module. Modern parsers do not separate the grammar from tokenizing, parsing, and validation. All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies. It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized. So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it. It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code. If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want. Best, Neil On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via Python-ideas wrote:

On Fri, Jun 5, 2015 at 5:38 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
Hi, Neil, thanks for that! Having studied only ancient parsers, I'd love to learn new ones. Can you please post references to modern parsing? Actual parsers, books, papers, anything you may find valuable. I have I hunch you're talking about PEG parsers, but maybe something else, or besides? Thanks! Best, Luciano -- Luciano Ramalho | Author of Fluent Python (O'Reilly, 2015) | http://shop.oreilly.com/product/0636920032519.do | Professor em: http://python.pro.br | Twitter: @ramalhoorg

Back in the day, I remember Lex and Yacc, then came Flex and Bison, and then ANTLR, which unified lexing and parsing under one common language. In general, I like the idea of putting everything together. I think that because of Python's separation of lexing and parsing, it accepts weird text like "(1if 0else 2)", which is crazy. Here's what I think I want in a parser: Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. I want to specify the lexical rules in the same way that I specify the parsing rules. And I think (after Andrew elucidates what he means by hooks) I want the parsing hooks to be the same thing as lexing hooks, and I agree with him that hooking into the lexer is useful. I want the parser module to be automatically-generated from the grammar if that's possible (I think it is). Typically each grammar rule is implemented using a class. I want the code generation to be a method on that class. This makes changing the AST easy. For example, it was suggested that we might change the grammar to include a starstar_expr node. This should be an easy change, but because of the way every node validates its children, which it expects to have a certain tree structure, it would be a big task with almost no payoff. There's also a question of which parsing algorithm you use. I wish I knew more about the state-of-art parsers. I was interested because I wanted to use Python to parse my LaTeX files. I got the impression that https://en.wikipedia.org/wiki/Earley_parser were state of the art, but I'm not sure. I'm curious what other people will contribute to this discussion as I think having no great parsing library is a huge hole in Python. Having one would definitely allow me to write better utilities using Python. On Fri, Jun 5, 2015 at 6:55 PM, Luciano Ramalho <luciano@ramalho.org> wrote:

IMO, lexer and parser separation is sometimes great. It also makes hand-written parsers much simpler. "Modern" parsing with no lexer and EBNF can sometimes be slower than the classics, especially if one is using an ultra-fast lexer generator such as re2c. On June 5, 2015 9:21:08 PM CDT, Neil Girdhar <mistersheik@gmail.com> wrote:
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

I don't see why it makes anything simpler. Your lexing rules just live alongside your parsing rules. And I also don't see why it has to be faster to do the lexing in a separate part of the code. Wouldn't the parser generator realize that that some of the rules don't use the stack and so they would end up just as fast as any lexer? On Fri, Jun 5, 2015 at 10:55 PM, Ryan Gonzalez <rymg19@gmail.com> wrote:

On Fri, Jun 5, 2015 at 7:57 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
You're putting a lot of faith in "modern" parsers. I don't know if PLY qualifies as such, but it certainly is newer than Lex/Yacc, and it unifies the lexer and parser. However I don't think it would be much better for a language the size of Python. We are using PLY at Dropbox to parse a medium-sized DSL, and while at the beginning it was convenient to have the entire language definition in one place, there were a fair number of subtle bugs in the earlier stages of the project due to the mixing of lexing and parsing. In order to get this right it seems you actually have to *think* about the lexing and parsing stages differently, and combining them in one tool doesn't actually help you to think more clearly. Also, this approach doesn't really do much for the later stages -- you can easily construct a parse tree but it's a fairly direct representation of the grammar rules, and it offers no help in managing a symbol table or generating code. -- --Guido van Rossum (python.org/~guido)

On Sat, Jun 6, 2015 at 12:27 AM, Guido van Rossum <guido@python.org> wrote:
I agree with you. I think the problem might be that the parser that I'm dreaming doesn't exist for Python. In another message, I wrote what I wanted: — Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. I want to specify the lexical rules in the same way that I specify the parsing rules. And I think (after Andrew elucidates what he means by hooks) I want the parsing hooks to be the same thing as lexing hooks, and I agree with him that hooking into the lexer is useful. I want the parser module to be automatically-generated from the grammar if that's possible (I think it is). Typically each grammar rule is implemented using a class. I want the code generation to be a method on that class. This makes changing the AST easy. For example, it was suggested that we might change the grammar to include a starstar_expr node. This should be an easy change, but because of the way every node validates its children, which it expects to have a certain tree structure, it would be a big task with almost no payoff. — I don't think this is possible with Ply.
That's interesting. I can understand wanting to separate them mentally, but two problems with separating at a fundamental programmatic level are: (1) you may want to change a lexical token like number to — in some cases — be LL(1) for who knows what reason; or (2) you would have to implement lexical hooks differently than parsing hooks. In some of Andrew's code below, the tokenize hook loos so different than the parser hook, and I think that's unfortunate.
It would be nice to generate the code in methods on the classes that implement the grammar rules. This would allow you to use memos that were filled in as you were parsing and validating to generate code.
-- --Guido van Rossum (python.org/~guido)

On Fri, Jun 5, 2015 at 9:27 PM, Guido van Rossum <guido@python.org> wrote:
PLY doesn't really "unify" the lexer and parser; it just provides both of them in the same Python package (and uses somewhat similar syntax and conventions for each). I wrote a project at my last consulting position to process a fairly complex DSL (used for code generation to several targets, Python, C++, Verilog, etc.). I like PLY, and decided to use that tool; but after a short while I gave up on the parser part of it, and only used the lexing, leaving parsing to "hand rolled" code. I'm sure I *could* have managed to shoehorn in the entire EBNF stuff into the parsing component of PLY. But for my own purpose, I found it more important to do various simplifications and modifications of the token stream before generating the data structures that defined the eventual output parameters. So in this respect, what I did is something like a simpler version of Python's compilation pipeline. Actually, what I did was probably terrible practice for parsing purists, but felt to me like the best "practicality beats purity" approach. There were these finite number of constructs in the DSL, and I would simply scan through the token stream, in several passes, trying to identify a particular construct, then pulling it out into the relevant data structure type, and just marking those tokens as "used". Other passes would look for other constructs, and in some cases I'd need to resolve a reference to one kind of construct that wasn't generated until a later pass in a "unification" step. There was a bit of duct tape and bailing wire involved in all of this, but it actually seemed to keep the code as simple as possible by isolating the code to generate each type of construct. None of which is actually relevant to what Python should do in its parsing, just a little bit of rambling thoughts. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Fri, Jun 5, 2015, at 22:21, Neil Girdhar wrote:
I don't think this really has anything to do with separation of lexing and parsing. C rejects this (where "this" is "integer followed by arbitrary alphabetic token") purely due to the lexing stage (specifically, 1if or 0else would be a single "preprocessor number" token, with no valid meaning. Of course, this has its own quirks, for example 0xE+1 is invalid in C.)

On 6 June 2015 at 12:21, Neil Girdhar <mistersheik@gmail.com> wrote:
The design of *Python's* grammar is deliberately restricted to being parsable with an LL(1) parser. There are a great many static analysis and syntax highlighting tools that are able to take advantage of that simplicity because they only care about the syntax, not the full semantics. Anyone actually doing their *own* parsing of something else *in* Python, would be better advised to reach for PLY (https://pypi.python.org/pypi/ply ). PLY is the parser underlying https://pypi.python.org/pypi/pycparser, and hence the highly regarded CFFI library, https://pypi.python.org/pypi/cffi Other notable parsing alternatives folks may want to look at include https://pypi.python.org/pypi/lrparsing and http://pythonhosted.org/pyparsing/ (both of which allow you to use Python code to define your grammar, rather than having to learn a formal grammar notation). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 June 2015 at 15:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
For the later stages of the pipeline (i.e. AST -> code generation), CPython now uses Eli Bendersky's asdl_parser: https://github.com/eliben/asdl_parser More background on that: http://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compi... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 6, 2015 at 1:00 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Given the validation that happens, it's not actually LL(1) though. It's mostly LL(1) with some syntax errors that are raised for various illegal constructs. Anyway, no one is suggesting changing the grammar.
I looked at ply and pyparsing, but it was impossible to simply parse LaTeX because I couldn't explain to suck up the right number of arguments given the name of the function. When it sees a function, it learns how many arguments that function needs. When it sees a function call \a{1}{2}{3}, if "\a" takes 2 arguments, then it should only suck up 1 and 2 as arguments, and leave 3 as a regular text token. In other words, I should be able to tell the parser what to expect in code that lives on the rule edges. The parsing tools you listed work really well until you need to do something like (1) the validation step that happens in Python, or (2) figuring out exactly where the syntax error is (line and column number) or (3) ensuring that whitespace separates some tokens even when it's not required to disambiguate different parse trees. I got the impression that they wanted to make these languages simple for the simple cases, but they were made too simple and don't allow you to do everything in one simple pass. Best, Neil

On June 6, 2015 12:29:21 AM CDT, Neil Girdhar <mistersheik@gmail.com> wrote:
Can't you just hack it into the lexer? When the slash is detected, the lexer can treat the following identifier as a function, look up the number of required arguments, and push it onto some sort of stack. Whenever a left bracket is encountered and another argument is needed by the TOS, it returns a special argument opener token.
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

On 6 June 2015 at 17:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
Let me just pimp https://pypi.python.org/pypi/Parsley here - I have written languages in both Parsely (a simple packaging metadata language) and its predecessor pymeta (in which I wrote pybars - handlebars.js for python) - and both were good implementations of OMeta, IMNSHO. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud

The best parsing library in Python I could find to my eyes is modgrammar: https://pythonhosted.org/modgrammar/ It's GLR I think. The documentation isn't bad and the syntax isn't too bad. The major change that I want to make to it is to replace the grammar class variables with regular instance generator methods, and to replace the components of the grammar return value, which are currently classes, with constructed objects. That way, a whitespace object that represents a block continuation can be constructed to know how much whitespace it must match. Similarly, a "suite" can include a constructed whitespace object that includes extra space. After it's matched, it can be queried for its size, and the grammar generator method can construct whitespace objects with the appropriate size. This eliminates the need for INDENT and DEDENT tokens. This kind of dynamic grammar generation is desirable for all kinds of other language related problems, like the LaTeX one I discussed, and it also allows us to merge all of the validation code into the parsing code, which follows "Don't Repeat Yourself". I think it's a better design. I will try to find time to build a demo of this this week. Ultimately, my problem with "token transformers" is, if I'm understanding correctly, that we want to change Python so that not only will 3.5 have Token transformers, but every Python after that has to support this. This risks constraining the development of the elegant solution. And for what major reason do we even need token transformers so soon? For a toy example on python ideas about automatic Decimal instances? Why can't a user define a one character function "d(x)" to do the conversion everywhere? I prefer to push for the better design even if it means waiting a year. Best, Neil On Sun, Jun 7, 2015 at 6:19 PM, Robert Collins <robertc@robertcollins.net> wrote:

On 8 June 2015 at 12:37, Neil Girdhar <mistersheik@gmail.com> wrote:
Neil, you're the only one proposing major structural changes to the code generation pipeline, and nobody at all is proposing anything for Python 3.5 (the feature freeze deadline for that has already passed). Andrew is essentially only proposing relatively minor tweaks to the API of the existing tokenizer module to make it more iterable based and less file based (while still preserving the file based APIs). Eugene Toder's and Dave Malcolm's patches from a few years ago make the existing AST -> bytecode section of the toolchain easier to modify and experiment with (and are ideas worth exploring for 3.6 if anyone is willing and able to invest the time to bring them back up to date). However, if you're specifically wanting to work on an "ideal parser API", then the reference interpreter for a 24 year old established language *isn't* the place to do it - the compromises necessitated by the need to align with an extensive existing ecosystem will actively work against your goal for a clean, minimalist structure. That's thus something better developed *independently* of CPython, and then potentially considered at some point in the future when it's better established what concrete benefits it would offer over the status quo for both the core development team and Python's end users. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 7, 2015 at 10:52 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's not what I'm doing. All I'm suggesting is that changes to Python that *preclude* the "ideal parser API" be avoided. I'm not trying to make the ideal API happen today. I'm just keeping the path to that rosy future free of obstacles.

On 8 June 2015 at 14:56, Neil Girdhar <mistersheik@gmail.com> wrote:
On Sun, Jun 7, 2015 at 10:52 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've used that approach in projects before, and in hindsight I realise that I caused significant disruption doing that. The reason boils down to - without consensus that the rosy future is all of: - the right future - worth doing eventually - more important to reach than solve problems that appear on the way then you end up frustrating folk that have problems now, without actually adding value to anyone: the project gets to choose between a future that [worst case, fails all three tests] might not be right, might not be worth doing, and is less important than actual problems which it is stopping solutions for. In this particular case, given Nick's comments about why we change the guts here, I'd say that 'worth doing eventually' is not in consensus, and I personally think that anything that is 'in the indefinite future' is almost never more important than problems affecting people today, because its a possible future benefit vs a current cost. There's probably an economics theorem to describe that, but I'm not an economist :) Pragmatically, I propose that the existing structure already has significant friction around any move to a unified (but still multi-pass I presume) parser infrastructure, and so adding a small amount of friction for substantial benefits will not substantially impact the future work. Concretely: a multi-stage parser with unified language for both lexer and parser should be quite amenable to calling out to a legacy token hook, without onerous impact. Failing that, we can follow the deprecation approach when someone finds we can't do that, and after a reasonable time remove the old hook. But right now, I think the onus is on you to show that a shim wouldn't be possible, rather than refusing to support adding a tokeniser hook because a shim isn't *obviously possible*. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud

On 8 June 2015 at 13:10, Robert Collins <robertc@robertcollins.net> wrote:
I don't know about economics, but for anyone that hasn't encountered it before, the phrase YAGNI is a good one to know: You Ain't Gonna Need It. ( http://c2.com/cgi/wiki?YouArentGonnaNeedIt ) The way YAGNI applies when deciding *to* do something is when you're faced with the following choice: * Making a particular change solves an immediate problem, but would make another possible change more complex in the future * Not making a change preserves the simplicity of the possible future change, but also doesn't solve the immediate problem Sometimes you'll get lucky and someone will figure out a third path that both addresses the immediate concern *and* leaves your future options open for other changes. More often though, you'll have to decide between these options, and in those cases "YAGNI" argues in favour of heavily discounting the potential increase in difficulty for a change you may never make anyway. Cheers, Nick. P.S. This tension between considering the long term implications of changes without allowing that consideration to block near term progress is what I personally see in the following two lines of the Zen of Python: Now is better than never. Although never is often better than *right* now. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Yes, but in this case the near term "problem" was as far as I can tell just parsing floats as decimals, which is easily done with a somewhat noisy function call. I don't see why it's important. The way that CPython does parsing is more than just annoying. It's a mess of repetition and tests that try to make sure that all of the phases are synchronized. I don't think that CPython is the future of Python. One day, someone will write a Python interpreter in Python that includes a clean one-pass parser. I would prefer to make that as easy to realize as possible. You might think it's far-fetched. I don't think it is. Best, Neil On Mon, Jun 8, 2015 at 12:18 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Jun 7, 2015, at 21:23, Neil Girdhar <mistersheik@gmail.com> wrote:
Yes, but in this case the near term "problem" was as far as I can tell just parsing floats as decimals, which is easily done with a somewhat noisy function call. I don't see why it's important.
This isn't the only case anyone's ever wanted. The tokenize module has been there since at least 1.5, and presumably it wasn't added for no good reason, or made to work with 3.x just for fun. And it has an example use in the docs. The only thing that's changed is that, now that postprocessing the AST has become a lot easier and less hacky because of the ast module and the succession of changes to the import process, the fact that tokenize is still clumsy and hacky is more noticeable.

On 8 June 2015 at 14:23, Neil Girdhar <mistersheik@gmail.com> wrote:
No, the problem to be solved is making it easier for people to "play" with Python's syntax and try out different ideas in a format that can be shared easily. The more people that are able to tinker and play with something, and share the results of their work, the more opportunities there are for good ideas to be had, and shared, eventually building up to the point of becoming a coherent proposal for change. The 3.4 dis module included several enhancements to make playing with bytecode easier and more fun: https://docs.python.org/3/whatsnew/3.4.html#dis 3.4 also added the source_to_code() hook in importlib to make it easy to tweak the compilation pass without having to learn all the other intricacies of the import system: https://docs.python.org/3/whatsnew/3.4.html#importlib MacroPy and Hylang are interesting examples of ways to manipulate the AST in order to use the CPython VM without relying solely on the native language syntax, while byteplay and Numba are examples of manipulating things at the bytecode level.
While the structure of CPython's code generation toolchain certainly poses high incidental barriers to entry, those barriers are trivial compared to the *inherent* barriers to entry involved in successfully making the case for a change like introducing a matrix multiplication operator or more clearly separating coroutines from generators through the async/await keywords (both matrix multiplication support and async/await landed for 3.5). If someone successfully makes the case for a compelling change to the language specification, then existing core developers are also ready, willing and able to assist in actually making the change to CPython. As a result, making that final step of *implementing* a syntactic change in CPython easier involves changing something that *isn't the bottleneck in the process*, so it would have no meaningful impact on the broader Python community. By contrast, making *more steps* of the existing compilation process easier for pure Python programmers to play with, preferably in an implementation independent way, *does* impact two of the bottlenecks: the implementation of syntactic ideas in executable form, and sharing those experiments with others. Combining that level of syntactic play with PyPy's ability to automatically generate JIT compilers offers an extraordinarily powerful platform for experimentation, with the standardisation process ensuring that accepted experiments also scale down to significantly more constrained environments like MicroPython. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Robert Collins writes:
There's probably an economics theorem to describe that, but I'm not an economist :)
I'd like to refuse the troll, but this one is too good to pass up. The book which is the authoritative source on the theorems you're looking for is Nancy Stokey's "The Economics of Inaction". 'Nuff said on this topic?<wink/>

On Jun 7, 2015, at 19:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
And also a patch to the existing ast module to allow it to handle tokenizers from Python as well as from C. The tokenizer tweaks themselves are just to make that easier (and to make using tokenizer a little simpler even if you don't feed it directly to the parser). (It surprised me that the C-level tokenizer actually can take C strings and string objects rather than file objects, but once you think about how the high-level C API stuff like being able to exec a single line must work, it's pretty obvious why that was added...)
I got a chance to take a look at this, and, while it seems completely orthogonal to what I'm trying to do, it also seems very cool. If someone got the patches up to date for the trunk and fixed the minor issues involved in the last review (both of which look pretty simple), what are the chances of getting it reviewed for 3.6? (I realize this is probably a better question for the issue tracker or the -dev list than buried in the middle of a barely-relevant -ideas thread, but I'm on my phone here, and you brought it up.:)

On 8 June 2015 at 13:18, Andrew Barnert <abarnert@yahoo.com> wrote:
I'm still interested in the underlying ideas, and while it's always possible to get completely surprised by life's twists and turns, I'm optimistic I'd be able to find the time to provide feedback on it myself for 3.6, and hopefully encourage other folks with experience with the compiler internals to review it as well. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Neil Girdhar <mistersheik@gmail.com> wrote: > Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. Agreed. For 3.4 it was possible to encode these particular semantics into the grammar itself, but it would no longer be LL(1). If I understood correctly, you wanted to handle lexing and parsing together. How would the INDENT/DEDENT tokens be generated? For my private ast generator, I did the opposite: I wanted to formalize the token preprocessing step, so I have: lexer -> parser1 (generates INDENT/DEDENT) -> parser2 (generates the ast directly) It isn't slower than what is in Python right now and you can hook into the token stream at any place. Stefan Krah

Maybe if every production has a link to its parent, then the spaces after a newline followed by statement reduce to indentation followed by statement, which reduces to indent or dedent or nothing followed by statement based on the parent's indentation level? In other words the parent (a file_input e.g.) has active control of the grammar of its children? On Sat, Jun 6, 2015 at 9:36 AM, s.krah <stefan@bytereef.org> wrote:

Ryan: I'm trying to figure out how the parsing library should be done — not trying to work around other designs. Stefan: maybe this is a better answer to your question. So thinking about this more, this is how I think it should be done: Each grammar rule is expressed as an Iterable. class FileInput: def __init__(self): self.indent_level = None def match(self): while True: matched = yield Disjunction( '\n', [Whitespace(self.indent_level, indent=False), Statement()]) if matched == '\n': break yield EndOfFile() class Suite: def __init__(self, indent_level): self.indent_level = indent_level def match(self): yield Disjunction( SimpleStatement(), ['\n', Whitespace(self.indent_level, indent=True), Repeat(Statement)]) # dedent is not required because the next statement knows its indent # level. On Sat, Jun 6, 2015 at 9:36 AM, s.krah <stefan@bytereef.org> wrote:

On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik@gmail.com> wrote:
Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar.
Yes, and it also builds CST nodes before building the AST, and there's a step after AST validation and before bytecode generation where the symbol table for the scope is built and LEG rules are applies. But none of those are things that seem particularly useful to hook. Maybe that's just a failure of imagination, but I've never wanted to do it. Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream. And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.
This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module.
And the code generation and other post-validation steps (unless you're just trying to create a do-nothing construct--which can be useful to give you something new to, e.g., feed to MacroPy, but it's not something you're going to check in to core, or use in your own production code). So yes, changing the grammar is painful. Which is just one more reason that being able to hack on Python without having to hack on Python is very useful. And as of 3.4, all of the pieces are there to do that, and dead-easy to use, and robust enough for production code--as long as the level you want to hack on is source text, AST, or bytecode, not token stream.
Modern parsers do not separate the grammar from tokenizing, parsing, and validation. All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies. It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized.
So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it. It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code.
Rewriting the entire parsing mechanism from scratch might simplify things, but it also means rewriting the entire parsing mechanism from scratch. I'm sure you could implement a GLR parser generator that takes a complete declarative grammar and generates something that goes right from source code to a SAX- or iterparse-style pre-validated AST, and that would be a really cool thing. But besides being a lot of work, it would also be a huge amount of risk. You'd almost certainly end up with new bugs, new places where syntax errors are harder to diagnose, new places where compiling is slower than it used to be, etc. Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible. Meanwhile, adding functions to create a token state struct out of a Python iterable, drive it, and expose that functionality to Python is a lot less work, very unlikely to have any effect on the existing default mechanism (if you don't hook the token stream, the existing code runs the same as today, except for an if check inside the next-token function), and much easier to make exactly compatible with existing behavior even when you do hook the token stream (if creating and driving the token state works, all the other code is the same as it ever was). And it has no backward compat implications.
If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want.
It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).

On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes, I think I understand your motivation. Can you help me understand the what the hook you would write would look like?
Yes.
I don't see why that is. The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser. Then you wouldn't need to add special hooks for lexing. You would merely have hooks for parsing.
I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places. With every change to the Python grammar, you pay for this forced repetition.

First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue. On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik@gmail.com> wrote: On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik@gmail.com> wrote:
Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.
And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.
Yes, I think I understand your motivation. Can you help me understand the what the hook you would write would look like?
That's a fair question. First, let's look at the relevant parts of an AST hook that transforms every float literal into a Decimal constructor call: class FloatNodeWrapper(ast.NodeTransformer): def visit_Num(self, node): if isinstance(node.n, float): return ast.Call(func=ast.Name(id='Decimal', ctx=ast.Load()), args=[ast.Str(s=repr(node.n))], keywords=[]) return node # ... def source_to_code(self, data, path, *, _optimize=-1): source = importlib.decode_source(data) tree = ast.parse(source) tree = FloatNodeWrapper().visit(tree) ast.fix_missing_locations(tree) return compile(tree, path, 'exec', dont_inherit=True, optimize=_optimize) Now, here's what I'd like to write for a token hook that does the same thing at the token level: def retokenize_float(tokens): for num, val, *loc in tokens: if num == tokenize.NUMBER and ('.' in val or 'e' in val or 'E' in val): yield tokenize.NAME, 'Decimal', *loc yield tokenize.OP, '(', *loc yield tokenize.STRING, repr(val), *loc yield tokenize.OP, ')', *loc else: yield num, val, *loc # ... def source_to_code(self, data, path, *, _optimize=-1): source = importlib.decode_source(data) tokens = tokenize.tokenize(source) tokens = retokenize(tokens) return compile(tokens, path, 'exec', dont_inherit=True, optimize=_optimize) Of course I don't want to do the same thing, I want to do something that you can't do at the AST level—see my user literal hack for an example. But this shows the parallels and differences between the two. If you want more background, see http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking... (which I wrote to explain to someone else how floatliteralhack works). Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).
Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.
I don't see why that is. The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser. Then you wouldn't need to add special hooks for lexing. You would merely have hooks for parsing.
Callbacks from a tree builder and using a user-modifiable grammar are clearly not backward compatible with ast.NodeTransformer. They're a completely different way of doing things. Is it a better way? Maybe. Plenty of people are using OMeta/JS every day. Of course plenty of people are cursing the fact that OMeta/JS sometimes generates exponential-time backtracking and it's never clear which part of your production rules are to blame, or the fact that you can't get a useful error message out of it, etc. And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.
It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).
I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places. With every change to the Python grammar, you pay for this forced repetition.
The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…

On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes. I want to point that if the lexer rules were alongside the parser, they would be generating ast nodes — so the hook for calling Decimal for all floating point tokens would be doable in the same way as your AST hook. For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
This is like saying "I want to change some things, but not other things". I want the best long-term solution, whatever that is. (I don't know what it is.) In the long run, moving towards the best solution tends to be the least total work. Specifically, if lexing hooks are implemented differently than parsing hooks, then that change is probably going to be backed out and replaced with the "ideal" — eventually — maybe in five years or ten or twenty. And when it's removed, there's deprecation periods and upgrade pains. At least let's explore what is the "ideal" solution?
I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
Yes, totally agree with you. So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python. And then I would try to convince the Python team to use that. I guess we don't disagree at all.
Cool, I didn't know it was even possible.

No. The way Python currently exposes things, the AST hook runs on an already-generated AST and transforms it into another one, to hand off to the code generator. That means it can only be used to handle things that parse as legal Python syntax (unless you replace the entire parser). What I want is a way to similarly take an already-generated token stream and transform it into another one, to hand off to the parser. That will allow it to be used to handle things that lex as legal Python tokens but don't parse as legal Python syntax, like what Paul suggested. Merging lexing into parsing not only doesn't give me that, it makes that impossible.
For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
But I don't want any new tokens. I just want to change the way existing tokens are interpreted. Just as with an AST hook like PyMacro, I don't want any new nodes, I just want to change the way existing nodes are interpreted.
Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).
This is like saying "I want to change some things, but not other things".
That's exactly what I'm saying. In particular, I want to change as few things as possible, to get what I want, without breaking stuff that already demonstrably works and has worked for decades.
I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
I don't know why you'd want to use Earley for parsing a programming language. IIRC, it was the first algorithm that could handle rampant ambiguity in polynomial time, but that isn't relevant to parsing programming languages (especially one like Python, which was explicitly designed to be simple to parse), and it isn't relevant to natural languages if you're not still in the 1960s, except in learning the theory and history of parsing. GLR does much better in almost-unambiguous/almost-deterministic languages; CYK can be easily extended with weights (which propagate sensibly, so you can use them for a final judgment, or to heuristically prune alternatives as you go); Valiant is easier to reason about mathematically; etc. And that's just among the parsers in the same basic family as Earley. Also, the point of OMeta is that it's not just a parsing algorithm, it's a complete system that's been designed and built and is being used to write DSLs, macros, and other language extensions in real-life code in languages like JavaScript and C#. So you don't have to imagine what kind of interface you could present or what it might be like to use it in practice, you can use it and find out. And I think it's in the same basic direction as the kind of interface you want for Python's parser.
And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.
Yes, totally agree with you. So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python. And then I would try to convince the Python team to use that. I guess we don't disagree at all.
Well, I think we disagree about the value of our time, and about the cost of disruptive changes. If I have a relatively low-work, almost completely non-disruptive way to definitely get everything I actually need, and a high-work, hugely-disruptive way to probably get what I actually need and also probably get a whole bunch of other useful stuff that I might be able to sell everyone else on if I also did a lot of additional work, that seems like a no-brainer to me. In fact, even if I wanted to write an amazing parser library for Python (and I kind of do, but I don't know if I have the time), I still don't think I'd want to suggest it as a replacement for the parser in CPython. Writing all the backward-compat adapters and porting the Python parser over with all its quirks intact and building the tests to prove that it's performance and error handling were strictly better and so on wouldn't be nearly as much fun as other things I could do with it.

On Sat, Jun 6, 2015 at 3:17 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes, and I what I was suggesting is for the lexer to return AST nodes, so it would be fine to process those nodes in the same way.
Yes, I see *how* you're trying to solve your problem, but my preference is to have one kind of hook rather than two kinds by unifying lexing and parsing. I think that's more elegant.
I suggested Earley to mitigate this fear of "exponential backtracking" since that won't happen in Earley.
If you ever decide to write that amazing parser library for Python and want any help please feel free to let me know. Best, Neil

On Jun 6, 2015, at 09:23, Neil Girdhar <mistersheik@gmail.com> wrote:
Seriously? Tokens don't form a tree, they form a list. Yes, every linked list is just a degenerate tree, so you could have every "node" just include the next one as a child. But why? Do you want to then the input text into a tree of character nodes? Python has all kinds of really good tools for dealing with iterables; why take away those tools and force me to work with a more complicated abstraction that Python doesn't have any tools for dealing with? In the case of the user-defined literal hack, for example, I can use the adjacent-pairs recipe from itertools and my transformation becomes trivial. I did it more explicitly in the hack I uploaded, using a generator function with a for statement, just to make it blindingly obvious what's happening. But if I had to deal with a tree, I'd either have to write explicit lookahead or store some state explicitly on the tree or the visitor. That isn't exactly _hard_, but it's certainly _harder_, and for no benefit. Also, if we got my change, I could write code that cleanly hooks parsing in 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people can at least use it, and all of the relevant and complicated code would be shared between the two versions. With your change, I'd have to write code that was completely different for 3.6+ than what I could backport, meaning I'd have to write, debug, and maintain two completely different implementations. And again, for no benefit. And finally, once again: we already have a token stream as part of the process, we already expose every other interesting step in the process, exposing the token stream as it exists today in a way that fits into everything else as it exists today is clearly the easiest and least disruptive thing to do. Sometimes it's worth doing harder or more disruptive things because they provide more benefit, but you haven't yet shown any benefit. You asked me for examples, and I provided them. Why don't you try writing a couple of actual examples--user literals, the LINQ-ish example from MacroPy, whatever--using your proposed design to show us how they could be simpler, or more elegant, or open up further possibilities. Or come up with an example of something your design could do that the existing one (even with my small proposed change) can't.
I'm trying to find a way to interpret this that makes sense. I think you're suggesting that we should throw out the idea of letting users write and install simple post-processing hooks in Python, because that will force us to find a way to instead make the entire parsing process user-customizable at runtime, which will force users to come up with "more elegant" solutions involving changing the grammar instead of post-processing it macro-style. If so, I think that's a very bad idea. Decades of practice in both Python and many other languages (especially those with built-in macro facilities) shows that post-processing at the relevant level is generally simple and elegant. Even if we had a fully-runtime-customizable parser, something like OMeta but "closing the loop" and implementing the language in the programmable metalanguage, many things are still simpler and more elegant written post-processing style (as used by existing import hooks, including MacroPy, and in other languages going all the way back to Lisp), and there's a much lower barrier to learning them, and there's much less risk of breaking the compiler/interpreter being used to run your hook in the first place. And, even if none of that were true, and your new and improved system really were simpler in every case, and you had actually built it rather than just envisioning it, there's still backward compatibility to think of. Do you really want to break working, documented functionality that people have written things like MacroPy on top of, even if forcing them to redesign and rewrite everything from scratch would force them to come up with a "more elegant" solution? And finally, the added flexibility of such a system is a cost as well as a benefit--the fact that Arc makes it as easy as possible to "rewrite the language into one that makes writing your application trivial" also means that one Arc programmer can't understand another's code until putting in a lot of effort to learn his idiosyncratic language.
I already explained that using standard PEG with a packrat parser instead of extended PEG with an OMeta-style parser gives you linear time. Why do you think telling me about a decades-older cubic-time algorithm designed for parsing natural languages that's a direct ancestor to two other algorithms I also already mentioned is going to be helpful? Do you not understand the advantages of PEG or GLR over Earley?

On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
I don't think I've said this explicitly yet, but I'm +1 on the idea of making it easier to "hack the token stream". As Andew has noted, there are two reasons this is an interesting level to work at for certain kinds of modifications: 1. The standard Python tokeniser has already taken care of converting the byte stream into Unicode code points, and the code point stream into tokens (including replacing leading whitespace with the structural INDENT/DEDENT tokens) 2. You get to work with a linear stream of tokens, rather than a precomposed tree of AST nodes that you have to traverse and keep consistent If all you're wanting to do is token rewriting, or to push the token stream over a network connection in preference to pushing raw source code or fully compiled bytecode, a bit of refactoring of the existing tokeniser/compiler interface to be less file based and more iterable based could make that easier to work with. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 7, 2015, at 03:30, random832@fastmail.us wrote:
I'm pretty sure that just doing nothing special here means you get a SyntaxError from the parser. Although I probably need more test cases. Anyway, this is one of those cases I mentioned where the SyntaxError can't actually show you what's wrong with the code, because the actual source doesn't have an error in it, only the transformed token stream. But there are easier ways to get that--just replace a `None` with a `with` in the token stream and you get an error that shows you a perfectly valid line, with no indication that a hook has screwed things up for you. I think we can at least detect that the tokens don't match the source line and throw in a note to go look for an installed token-transforming hook. It would be even nicer if we could show what the untokenized line looks like, so the user can see why it's an error. Something like this: File "<input>", line 1 if spam is None: ^ SyntaxError: invalid syntax Tokens do not match input, parsed as if spam is with : Of course in the specific case you mentioned of unbalanced parens swallowing a dedent, the output still wouldn't be useful, but I'm not sure what we could show usefully in that case anyway.

On Sun, Jun 7, 2015, at 09:24, Andrew Barnert wrote:
I'm pretty sure that just doing nothing special here means you get a SyntaxError from the parser. Although I probably need more test cases.
I'm actually talking about what happens if the _untransformed_ stream contains an unbalanced bracket that the hook is supposed to eliminate (and/or supply the balancing one). My mental model of this idea was that the "lexer" generates the entire untransformed (but including indent/dedent magic etc) token sequence, then supplies it to the hook.

random832@fastmail.us wrote:
I'm actually talking about what happens if the _untransformed_ stream contains an unbalanced bracket that the hook is supposed to eliminate
I'm of the opinion that designing an input language to require or allow unmatched brackets is a bad idea. If nothing else, it causes grief for editors that do bracket matching. -- Greg

On Jun 6, 2015, at 22:59, Nick Coghlan <ncoghlan@gmail.com> wrote:
Actually, as I discovered while trying to hack in the change this afternoon, the C tokenizer doesn't actually take care of conveying the byte stream. It does take care of detecting the encoding, but what it hands to the parsetok function is still encoded bytes. The Python wrapper does transparently decode for you (in 3.x), but that actually just makes it harder to feed the output back into the parser, because the parser wants encoded bytes. (Also, as I mentioned before, it would be nice if the Python wrapper could just take Unicode in the first place, because the most obvious place to use this is in an import hook, where you can detect and decode the bytes yourself in as single line, and it's easier to just use the string than to encode it to UTF-8 so the tokenizer can detect UTF-8 so either the Python tokenizer wrapper or the C parser can decode it again...). Anyway, this part was at least easy to temporarily work around; the stumbling block that prevented me from finishing a working implementation this afternoon is a bit hairier. The C tokenizer hands the parser the current line (which can actually be multiple lines) and start and end pointers to characters within that line. It also hands it the current token string, but the parser ignores that and just reads from line+start to line+end. The Python tokenizer, on the other hand, gives you line number and (Unicode-based) column numbers for start and end. Converting those to encoded-bytes offsets isn't _that_ hard... but those are offsets into the original (encoded) line, so the parser is going to see the value of the original token rather than the token value(s) you're trying to substitute, which defeats the entire purpose. I was able to implement a hacky workaround using untokenize to fake the current line and provide offsets within that, but that means you get garbage from SyntaxErrors, and all your column numbers--and, worse, all your line numbers, if you add in a multi-line token--are off within the AST and bytecode. (And there may be other problems; those are just the ones I saw immediately when I tried it...) I think what I'm going to try next is to fork the whole parsetok function and write a version that uses the token's string instead of the substring of the line, and start and stop as offsets instead of pointers. I'm still not sure whether the token string and line should be in tok->encoding, UTF-8, UTF-32, or a PyUnicode object, but I'll figure that out as I do it.... Once I get that working for the wrapped-up token iterator, then I can see if I can reunify it with the existing version for the C tokenizer (without any performance penalty, and without breaking pgen). I'd hate to have two copies of that giant function to keep in sync. Meanwhile, I'm not sure what to do about tokens that don't have the optional start/stop/line values. Maybe just not allow them (just because untokenize can handle it doesn't mean ast.parse has to), or maybe just untokenize a fake line (and if any SyntaxErrors are ugly and undebuggable, well, don't skip those values...). The latter might be useful if for some reason you wanted to generate tokens on the fly instead of just munging a stream of tokens from source text you have available. I'm also not sure what to do about a few error cases. For example, if you feed the parser something that isn't iterable, or whose values aren't an iterable of iterables of length 2 to 5 with the right types, that really feels more like a TypeError than a SyntaxError (and that would also be a good way to signal the end user that the bug is in the token stream transformer rather than in the source code...), but raising a TypeError from within the parser requires a bit more refactoring (the tokenizer can't tell you what error to raise, just that the current token is an error along with a tokenizer error code--although I code add an E_NOTOK error code that the parser interprets as "raise a TypeError instead of a SyntaxError"), and I'm not sure whether that would affect any other stuff. Anyway, for the first pass I may just leave it as a SyntaxError, just to get something working. Finally, it might be nice if it were possible to generate a SyntaxError that showed the original source line but also told you that the tokens don't match the source (again, to signal the end user that he should look at what the hook did to his code, not just his code), but I'm not sure how necessary that is, or how easy it will be (it depends on how I end up refactoring parsetok).
I didn't think about that use case at all, but that could be very handy.

On Sun, Jun 7, 2015 at 1:59 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I will explain in another message how to replace the indent and dedent tokens so that the lexer loses most of its "magic" and becomes just like the parser.
The AST nodes would contain within them the linear stream of tokens that you are free to work with. The AST also encodes the structure of the tokens, which can be very useful if only to debug how the tokens are being parsed. You might find yourself, when doing a more complicated lexical transformation, trying to reverse engineer where the parse tree nodes begin and end in the token stream. This would be a nightmare. This is the main problem with trying to process the token stream "blind" to the parse tree.
You can still do all that with the tokens included in the parse tree.

On 8 June 2015 at 12:23, Neil Girdhar <mistersheik@gmail.com> wrote:
I don't dispute that this *can* be done, but what would it let me do that I can't already do today? I addition, how will I be able to continue to do all the things that I can do today with the separate tokenisation step? *Adding* steps to the compilation toolchain is doable (one of the first things I was involved in CPython core development was the introduction of the AST based parser in Python 2.5), but taking them *away* is much harder. You appear to have an idealised version of what a code generation toolchain "should" be, and would like to hammer CPython's code generation pipeline specifically into that mould. That's not the way this works - we don't change the code generator for the sake of it, we change it to solve specific problems with it. Introducing the AST layer solved a problem. Introducing an AST optimisation pass would solve a problem. Making the token stream easier to manipulate would solve a problem. Merging the lexer and the parser doesn't solve any problem that we have.
Anything that cares about the structure to that degree shouldn't be manipulating the token stream - it should be working on the parse tree.
Not as easily, because I have to navigate the parse tree even when I don't care about that structure, rather than being able to just look at the tokens in isolation. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 7, 2015 at 10:42 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
You're right. And as usual, Nick, your analysis is spot on. My main concern is that the idealized way of parsing the language is not precluded by any change. Does adding token manipulation promise forwards compatibility? Will a Python 3.9 have to have the same kind of token manipulator exposed. If not, then I'm +1 on token manipulation. :)
I don't think it would be more of a burden than it would prevent bugs by allowing you to ensure that the parse tree structure is what you think it is. It's a matter of intuition I guess.

On 8 June 2015 at 12:47, Neil Girdhar <mistersheik@gmail.com> wrote:
That may have been the heart of the confusion, as token manipulation is *already* a public feature: https://docs.python.org/3/library/tokenize.html The tokenizer module has been a public part of Python for longer than I've been a Pythonista (first documented in 1.5.2 in 1999): https://docs.python.org/release/1.5.2/lib/module-tokenize.html As a result, token stream manipulation is already possible, you just have to combine the tokens back into a byte stream before feeding them to the compiler. Any future Python interpreter would be free to fall back on implementing a token based API that way, if the CPython code generator itself were to gain a native token stream interface. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 6, 2015 at 6:52 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
When the work done by the lexer is done by the parser, the characters contained in a lexical node will be siblings. Typically, in Python a tree is represented as nodes with iterable children, so the characters would just be a string.
The stream would still be just an iterable.
You would still be able to do all this stuff.
If I find time, I'll do that. I will explain my solution in another message.
I understand that you are motivated by a specific problem. However, your solution does not solve the general problem. If you only allow transformations of the token stream, the token set is fixed. Transformations of the token stream also hide — even in your example — the fact that you're actually building what is conceptually a subtree. It makes more sense to me to solve the problem in general, once and for all. (1) Make it easy to change the grammar, and (2) make lexing part of the grammar. Now, you don't have to change the grammar to solve some problems. Sometimes, you can just use AST transformers to accomplish what you were doing with a lexical transformer. That's nice because it's one less thing to learn. Sometimes, you need the power of changing the grammar. That is already coming with the popularity of languages like Theano. I really want to transform Python code into Theano, and for that it may be more elegant to change the grammar.

On 6 June 2015 at 15:30, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…
Eugene Toder had a decent go at introducing more autogeneration into the code generation code a few years ago as part of building out an AST level optimiser: http://bugs.python.org/issue11549 The basic concepts Eugene introduced still seem sound to me, there'd just be some work in bringing the patches up to date to target 3.6. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar. This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module. Modern parsers do not separate the grammar from tokenizing, parsing, and validation. All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies. It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized. So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it. It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code. If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want. Best, Neil On Friday, June 5, 2015 at 5:30:23 AM UTC-4, Andrew Barnert via Python-ideas wrote:

On Fri, Jun 5, 2015 at 5:38 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
Hi, Neil, thanks for that! Having studied only ancient parsers, I'd love to learn new ones. Can you please post references to modern parsing? Actual parsers, books, papers, anything you may find valuable. I have I hunch you're talking about PEG parsers, but maybe something else, or besides? Thanks! Best, Luciano -- Luciano Ramalho | Author of Fluent Python (O'Reilly, 2015) | http://shop.oreilly.com/product/0636920032519.do | Professor em: http://python.pro.br | Twitter: @ramalhoorg

Back in the day, I remember Lex and Yacc, then came Flex and Bison, and then ANTLR, which unified lexing and parsing under one common language. In general, I like the idea of putting everything together. I think that because of Python's separation of lexing and parsing, it accepts weird text like "(1if 0else 2)", which is crazy. Here's what I think I want in a parser: Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. I want to specify the lexical rules in the same way that I specify the parsing rules. And I think (after Andrew elucidates what he means by hooks) I want the parsing hooks to be the same thing as lexing hooks, and I agree with him that hooking into the lexer is useful. I want the parser module to be automatically-generated from the grammar if that's possible (I think it is). Typically each grammar rule is implemented using a class. I want the code generation to be a method on that class. This makes changing the AST easy. For example, it was suggested that we might change the grammar to include a starstar_expr node. This should be an easy change, but because of the way every node validates its children, which it expects to have a certain tree structure, it would be a big task with almost no payoff. There's also a question of which parsing algorithm you use. I wish I knew more about the state-of-art parsers. I was interested because I wanted to use Python to parse my LaTeX files. I got the impression that https://en.wikipedia.org/wiki/Earley_parser were state of the art, but I'm not sure. I'm curious what other people will contribute to this discussion as I think having no great parsing library is a huge hole in Python. Having one would definitely allow me to write better utilities using Python. On Fri, Jun 5, 2015 at 6:55 PM, Luciano Ramalho <luciano@ramalho.org> wrote:

IMO, lexer and parser separation is sometimes great. It also makes hand-written parsers much simpler. "Modern" parsing with no lexer and EBNF can sometimes be slower than the classics, especially if one is using an ultra-fast lexer generator such as re2c. On June 5, 2015 9:21:08 PM CDT, Neil Girdhar <mistersheik@gmail.com> wrote:
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

I don't see why it makes anything simpler. Your lexing rules just live alongside your parsing rules. And I also don't see why it has to be faster to do the lexing in a separate part of the code. Wouldn't the parser generator realize that that some of the rules don't use the stack and so they would end up just as fast as any lexer? On Fri, Jun 5, 2015 at 10:55 PM, Ryan Gonzalez <rymg19@gmail.com> wrote:

On Fri, Jun 5, 2015 at 7:57 PM, Neil Girdhar <mistersheik@gmail.com> wrote:
You're putting a lot of faith in "modern" parsers. I don't know if PLY qualifies as such, but it certainly is newer than Lex/Yacc, and it unifies the lexer and parser. However I don't think it would be much better for a language the size of Python. We are using PLY at Dropbox to parse a medium-sized DSL, and while at the beginning it was convenient to have the entire language definition in one place, there were a fair number of subtle bugs in the earlier stages of the project due to the mixing of lexing and parsing. In order to get this right it seems you actually have to *think* about the lexing and parsing stages differently, and combining them in one tool doesn't actually help you to think more clearly. Also, this approach doesn't really do much for the later stages -- you can easily construct a parse tree but it's a fairly direct representation of the grammar rules, and it offers no help in managing a symbol table or generating code. -- --Guido van Rossum (python.org/~guido)

On Sat, Jun 6, 2015 at 12:27 AM, Guido van Rossum <guido@python.org> wrote:
I agree with you. I think the problem might be that the parser that I'm dreaming doesn't exist for Python. In another message, I wrote what I wanted: — Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. I want to specify the lexical rules in the same way that I specify the parsing rules. And I think (after Andrew elucidates what he means by hooks) I want the parsing hooks to be the same thing as lexing hooks, and I agree with him that hooking into the lexer is useful. I want the parser module to be automatically-generated from the grammar if that's possible (I think it is). Typically each grammar rule is implemented using a class. I want the code generation to be a method on that class. This makes changing the AST easy. For example, it was suggested that we might change the grammar to include a starstar_expr node. This should be an easy change, but because of the way every node validates its children, which it expects to have a certain tree structure, it would be a big task with almost no payoff. — I don't think this is possible with Ply.
That's interesting. I can understand wanting to separate them mentally, but two problems with separating at a fundamental programmatic level are: (1) you may want to change a lexical token like number to — in some cases — be LL(1) for who knows what reason; or (2) you would have to implement lexical hooks differently than parsing hooks. In some of Andrew's code below, the tokenize hook loos so different than the parser hook, and I think that's unfortunate.
It would be nice to generate the code in methods on the classes that implement the grammar rules. This would allow you to use memos that were filled in as you were parsing and validating to generate code.
-- --Guido van Rossum (python.org/~guido)

On Fri, Jun 5, 2015 at 9:27 PM, Guido van Rossum <guido@python.org> wrote:
PLY doesn't really "unify" the lexer and parser; it just provides both of them in the same Python package (and uses somewhat similar syntax and conventions for each). I wrote a project at my last consulting position to process a fairly complex DSL (used for code generation to several targets, Python, C++, Verilog, etc.). I like PLY, and decided to use that tool; but after a short while I gave up on the parser part of it, and only used the lexing, leaving parsing to "hand rolled" code. I'm sure I *could* have managed to shoehorn in the entire EBNF stuff into the parsing component of PLY. But for my own purpose, I found it more important to do various simplifications and modifications of the token stream before generating the data structures that defined the eventual output parameters. So in this respect, what I did is something like a simpler version of Python's compilation pipeline. Actually, what I did was probably terrible practice for parsing purists, but felt to me like the best "practicality beats purity" approach. There were these finite number of constructs in the DSL, and I would simply scan through the token stream, in several passes, trying to identify a particular construct, then pulling it out into the relevant data structure type, and just marking those tokens as "used". Other passes would look for other constructs, and in some cases I'd need to resolve a reference to one kind of construct that wasn't generated until a later pass in a "unification" step. There was a bit of duct tape and bailing wire involved in all of this, but it actually seemed to keep the code as simple as possible by isolating the code to generate each type of construct. None of which is actually relevant to what Python should do in its parsing, just a little bit of rambling thoughts. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Fri, Jun 5, 2015, at 22:21, Neil Girdhar wrote:
I don't think this really has anything to do with separation of lexing and parsing. C rejects this (where "this" is "integer followed by arbitrary alphabetic token") purely due to the lexing stage (specifically, 1if or 0else would be a single "preprocessor number" token, with no valid meaning. Of course, this has its own quirks, for example 0xE+1 is invalid in C.)

On 6 June 2015 at 12:21, Neil Girdhar <mistersheik@gmail.com> wrote:
The design of *Python's* grammar is deliberately restricted to being parsable with an LL(1) parser. There are a great many static analysis and syntax highlighting tools that are able to take advantage of that simplicity because they only care about the syntax, not the full semantics. Anyone actually doing their *own* parsing of something else *in* Python, would be better advised to reach for PLY (https://pypi.python.org/pypi/ply ). PLY is the parser underlying https://pypi.python.org/pypi/pycparser, and hence the highly regarded CFFI library, https://pypi.python.org/pypi/cffi Other notable parsing alternatives folks may want to look at include https://pypi.python.org/pypi/lrparsing and http://pythonhosted.org/pyparsing/ (both of which allow you to use Python code to define your grammar, rather than having to learn a formal grammar notation). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 June 2015 at 15:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
For the later stages of the pipeline (i.e. AST -> code generation), CPython now uses Eli Bendersky's asdl_parser: https://github.com/eliben/asdl_parser More background on that: http://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compi... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 6, 2015 at 1:00 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Given the validation that happens, it's not actually LL(1) though. It's mostly LL(1) with some syntax errors that are raised for various illegal constructs. Anyway, no one is suggesting changing the grammar.
I looked at ply and pyparsing, but it was impossible to simply parse LaTeX because I couldn't explain to suck up the right number of arguments given the name of the function. When it sees a function, it learns how many arguments that function needs. When it sees a function call \a{1}{2}{3}, if "\a" takes 2 arguments, then it should only suck up 1 and 2 as arguments, and leave 3 as a regular text token. In other words, I should be able to tell the parser what to expect in code that lives on the rule edges. The parsing tools you listed work really well until you need to do something like (1) the validation step that happens in Python, or (2) figuring out exactly where the syntax error is (line and column number) or (3) ensuring that whitespace separates some tokens even when it's not required to disambiguate different parse trees. I got the impression that they wanted to make these languages simple for the simple cases, but they were made too simple and don't allow you to do everything in one simple pass. Best, Neil

On June 6, 2015 12:29:21 AM CDT, Neil Girdhar <mistersheik@gmail.com> wrote:
Can't you just hack it into the lexer? When the slash is detected, the lexer can treat the following identifier as a function, look up the number of required arguments, and push it onto some sort of stack. Whenever a left bracket is encountered and another argument is needed by the TOS, it returns a special argument opener token.
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

On 6 June 2015 at 17:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
Let me just pimp https://pypi.python.org/pypi/Parsley here - I have written languages in both Parsely (a simple packaging metadata language) and its predecessor pymeta (in which I wrote pybars - handlebars.js for python) - and both were good implementations of OMeta, IMNSHO. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud

The best parsing library in Python I could find to my eyes is modgrammar: https://pythonhosted.org/modgrammar/ It's GLR I think. The documentation isn't bad and the syntax isn't too bad. The major change that I want to make to it is to replace the grammar class variables with regular instance generator methods, and to replace the components of the grammar return value, which are currently classes, with constructed objects. That way, a whitespace object that represents a block continuation can be constructed to know how much whitespace it must match. Similarly, a "suite" can include a constructed whitespace object that includes extra space. After it's matched, it can be queried for its size, and the grammar generator method can construct whitespace objects with the appropriate size. This eliminates the need for INDENT and DEDENT tokens. This kind of dynamic grammar generation is desirable for all kinds of other language related problems, like the LaTeX one I discussed, and it also allows us to merge all of the validation code into the parsing code, which follows "Don't Repeat Yourself". I think it's a better design. I will try to find time to build a demo of this this week. Ultimately, my problem with "token transformers" is, if I'm understanding correctly, that we want to change Python so that not only will 3.5 have Token transformers, but every Python after that has to support this. This risks constraining the development of the elegant solution. And for what major reason do we even need token transformers so soon? For a toy example on python ideas about automatic Decimal instances? Why can't a user define a one character function "d(x)" to do the conversion everywhere? I prefer to push for the better design even if it means waiting a year. Best, Neil On Sun, Jun 7, 2015 at 6:19 PM, Robert Collins <robertc@robertcollins.net> wrote:

On 8 June 2015 at 12:37, Neil Girdhar <mistersheik@gmail.com> wrote:
Neil, you're the only one proposing major structural changes to the code generation pipeline, and nobody at all is proposing anything for Python 3.5 (the feature freeze deadline for that has already passed). Andrew is essentially only proposing relatively minor tweaks to the API of the existing tokenizer module to make it more iterable based and less file based (while still preserving the file based APIs). Eugene Toder's and Dave Malcolm's patches from a few years ago make the existing AST -> bytecode section of the toolchain easier to modify and experiment with (and are ideas worth exploring for 3.6 if anyone is willing and able to invest the time to bring them back up to date). However, if you're specifically wanting to work on an "ideal parser API", then the reference interpreter for a 24 year old established language *isn't* the place to do it - the compromises necessitated by the need to align with an extensive existing ecosystem will actively work against your goal for a clean, minimalist structure. That's thus something better developed *independently* of CPython, and then potentially considered at some point in the future when it's better established what concrete benefits it would offer over the status quo for both the core development team and Python's end users. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 7, 2015 at 10:52 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's not what I'm doing. All I'm suggesting is that changes to Python that *preclude* the "ideal parser API" be avoided. I'm not trying to make the ideal API happen today. I'm just keeping the path to that rosy future free of obstacles.

On 8 June 2015 at 14:56, Neil Girdhar <mistersheik@gmail.com> wrote:
On Sun, Jun 7, 2015 at 10:52 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've used that approach in projects before, and in hindsight I realise that I caused significant disruption doing that. The reason boils down to - without consensus that the rosy future is all of: - the right future - worth doing eventually - more important to reach than solve problems that appear on the way then you end up frustrating folk that have problems now, without actually adding value to anyone: the project gets to choose between a future that [worst case, fails all three tests] might not be right, might not be worth doing, and is less important than actual problems which it is stopping solutions for. In this particular case, given Nick's comments about why we change the guts here, I'd say that 'worth doing eventually' is not in consensus, and I personally think that anything that is 'in the indefinite future' is almost never more important than problems affecting people today, because its a possible future benefit vs a current cost. There's probably an economics theorem to describe that, but I'm not an economist :) Pragmatically, I propose that the existing structure already has significant friction around any move to a unified (but still multi-pass I presume) parser infrastructure, and so adding a small amount of friction for substantial benefits will not substantially impact the future work. Concretely: a multi-stage parser with unified language for both lexer and parser should be quite amenable to calling out to a legacy token hook, without onerous impact. Failing that, we can follow the deprecation approach when someone finds we can't do that, and after a reasonable time remove the old hook. But right now, I think the onus is on you to show that a shim wouldn't be possible, rather than refusing to support adding a tokeniser hook because a shim isn't *obviously possible*. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud

On 8 June 2015 at 13:10, Robert Collins <robertc@robertcollins.net> wrote:
I don't know about economics, but for anyone that hasn't encountered it before, the phrase YAGNI is a good one to know: You Ain't Gonna Need It. ( http://c2.com/cgi/wiki?YouArentGonnaNeedIt ) The way YAGNI applies when deciding *to* do something is when you're faced with the following choice: * Making a particular change solves an immediate problem, but would make another possible change more complex in the future * Not making a change preserves the simplicity of the possible future change, but also doesn't solve the immediate problem Sometimes you'll get lucky and someone will figure out a third path that both addresses the immediate concern *and* leaves your future options open for other changes. More often though, you'll have to decide between these options, and in those cases "YAGNI" argues in favour of heavily discounting the potential increase in difficulty for a change you may never make anyway. Cheers, Nick. P.S. This tension between considering the long term implications of changes without allowing that consideration to block near term progress is what I personally see in the following two lines of the Zen of Python: Now is better than never. Although never is often better than *right* now. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Yes, but in this case the near term "problem" was as far as I can tell just parsing floats as decimals, which is easily done with a somewhat noisy function call. I don't see why it's important. The way that CPython does parsing is more than just annoying. It's a mess of repetition and tests that try to make sure that all of the phases are synchronized. I don't think that CPython is the future of Python. One day, someone will write a Python interpreter in Python that includes a clean one-pass parser. I would prefer to make that as easy to realize as possible. You might think it's far-fetched. I don't think it is. Best, Neil On Mon, Jun 8, 2015 at 12:18 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On Jun 7, 2015, at 21:23, Neil Girdhar <mistersheik@gmail.com> wrote:
Yes, but in this case the near term "problem" was as far as I can tell just parsing floats as decimals, which is easily done with a somewhat noisy function call. I don't see why it's important.
This isn't the only case anyone's ever wanted. The tokenize module has been there since at least 1.5, and presumably it wasn't added for no good reason, or made to work with 3.x just for fun. And it has an example use in the docs. The only thing that's changed is that, now that postprocessing the AST has become a lot easier and less hacky because of the ast module and the succession of changes to the import process, the fact that tokenize is still clumsy and hacky is more noticeable.

On 8 June 2015 at 14:23, Neil Girdhar <mistersheik@gmail.com> wrote:
No, the problem to be solved is making it easier for people to "play" with Python's syntax and try out different ideas in a format that can be shared easily. The more people that are able to tinker and play with something, and share the results of their work, the more opportunities there are for good ideas to be had, and shared, eventually building up to the point of becoming a coherent proposal for change. The 3.4 dis module included several enhancements to make playing with bytecode easier and more fun: https://docs.python.org/3/whatsnew/3.4.html#dis 3.4 also added the source_to_code() hook in importlib to make it easy to tweak the compilation pass without having to learn all the other intricacies of the import system: https://docs.python.org/3/whatsnew/3.4.html#importlib MacroPy and Hylang are interesting examples of ways to manipulate the AST in order to use the CPython VM without relying solely on the native language syntax, while byteplay and Numba are examples of manipulating things at the bytecode level.
While the structure of CPython's code generation toolchain certainly poses high incidental barriers to entry, those barriers are trivial compared to the *inherent* barriers to entry involved in successfully making the case for a change like introducing a matrix multiplication operator or more clearly separating coroutines from generators through the async/await keywords (both matrix multiplication support and async/await landed for 3.5). If someone successfully makes the case for a compelling change to the language specification, then existing core developers are also ready, willing and able to assist in actually making the change to CPython. As a result, making that final step of *implementing* a syntactic change in CPython easier involves changing something that *isn't the bottleneck in the process*, so it would have no meaningful impact on the broader Python community. By contrast, making *more steps* of the existing compilation process easier for pure Python programmers to play with, preferably in an implementation independent way, *does* impact two of the bottlenecks: the implementation of syntactic ideas in executable form, and sharing those experiments with others. Combining that level of syntactic play with PyPy's ability to automatically generate JIT compilers offers an extraordinarily powerful platform for experimentation, with the standardisation process ensuring that accepted experiments also scale down to significantly more constrained environments like MicroPython. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Robert Collins writes:
There's probably an economics theorem to describe that, but I'm not an economist :)
I'd like to refuse the troll, but this one is too good to pass up. The book which is the authoritative source on the theorems you're looking for is Nancy Stokey's "The Economics of Inaction". 'Nuff said on this topic?<wink/>

On Jun 7, 2015, at 19:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
And also a patch to the existing ast module to allow it to handle tokenizers from Python as well as from C. The tokenizer tweaks themselves are just to make that easier (and to make using tokenizer a little simpler even if you don't feed it directly to the parser). (It surprised me that the C-level tokenizer actually can take C strings and string objects rather than file objects, but once you think about how the high-level C API stuff like being able to exec a single line must work, it's pretty obvious why that was added...)
I got a chance to take a look at this, and, while it seems completely orthogonal to what I'm trying to do, it also seems very cool. If someone got the patches up to date for the trunk and fixed the minor issues involved in the last review (both of which look pretty simple), what are the chances of getting it reviewed for 3.6? (I realize this is probably a better question for the issue tracker or the -dev list than buried in the middle of a barely-relevant -ideas thread, but I'm on my phone here, and you brought it up.:)

On 8 June 2015 at 13:18, Andrew Barnert <abarnert@yahoo.com> wrote:
I'm still interested in the underlying ideas, and while it's always possible to get completely surprised by life's twists and turns, I'm optimistic I'd be able to find the time to provide feedback on it myself for 3.6, and hopefully encourage other folks with experience with the compiler internals to review it as well. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Neil Girdhar <mistersheik@gmail.com> wrote: > Along with the grammar, you also give it code that it can execute as it matches each symbol in a rule. In Python for example, as it matches each argument passed to a function, it would keep track of the count of *args, **kwargs, and keyword arguments, and regular arguments, and then raise a syntax error if it encounters anything out of order. Right now that check is done in validate.c, which is really annoying. Agreed. For 3.4 it was possible to encode these particular semantics into the grammar itself, but it would no longer be LL(1). If I understood correctly, you wanted to handle lexing and parsing together. How would the INDENT/DEDENT tokens be generated? For my private ast generator, I did the opposite: I wanted to formalize the token preprocessing step, so I have: lexer -> parser1 (generates INDENT/DEDENT) -> parser2 (generates the ast directly) It isn't slower than what is in Python right now and you can hook into the token stream at any place. Stefan Krah

Maybe if every production has a link to its parent, then the spaces after a newline followed by statement reduce to indentation followed by statement, which reduces to indent or dedent or nothing followed by statement based on the parent's indentation level? In other words the parent (a file_input e.g.) has active control of the grammar of its children? On Sat, Jun 6, 2015 at 9:36 AM, s.krah <stefan@bytereef.org> wrote:

Ryan: I'm trying to figure out how the parsing library should be done — not trying to work around other designs. Stefan: maybe this is a better answer to your question. So thinking about this more, this is how I think it should be done: Each grammar rule is expressed as an Iterable. class FileInput: def __init__(self): self.indent_level = None def match(self): while True: matched = yield Disjunction( '\n', [Whitespace(self.indent_level, indent=False), Statement()]) if matched == '\n': break yield EndOfFile() class Suite: def __init__(self, indent_level): self.indent_level = indent_level def match(self): yield Disjunction( SimpleStatement(), ['\n', Whitespace(self.indent_level, indent=True), Repeat(Statement)]) # dedent is not required because the next statement knows its indent # level. On Sat, Jun 6, 2015 at 9:36 AM, s.krah <stefan@bytereef.org> wrote:

On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik@gmail.com> wrote:
Actually CPython has another step between the AST and the bytecode, which validates the AST to block out trees that violate various rules that were not easily incorporated into the LL(1) grammar.
Yes, and it also builds CST nodes before building the AST, and there's a step after AST validation and before bytecode generation where the symbol table for the scope is built and LEG rules are applies. But none of those are things that seem particularly useful to hook. Maybe that's just a failure of imagination, but I've never wanted to do it. Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream. And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.
This means that when you want to change parsing, you have to change: the grammar, the AST library, the validation library, and Python's exposed parsing module.
And the code generation and other post-validation steps (unless you're just trying to create a do-nothing construct--which can be useful to give you something new to, e.g., feed to MacroPy, but it's not something you're going to check in to core, or use in your own production code). So yes, changing the grammar is painful. Which is just one more reason that being able to hack on Python without having to hack on Python is very useful. And as of 3.4, all of the pieces are there to do that, and dead-easy to use, and robust enough for production code--as long as the level you want to hack on is source text, AST, or bytecode, not token stream.
Modern parsers do not separate the grammar from tokenizing, parsing, and validation. All of these are done in one place, which not only simplifies changes to the grammar, but also protects you from possible inconsistencies. It was really hard for me when I was making changes to the parser to keep my conception of these four things synchronized.
So in my opinion, if you're going to modernize the parsing, then put it all together into one simple library that deals with all of it. It seems like what you're suggesting would add complexity, whereas a merged solution would simplify the code.
Rewriting the entire parsing mechanism from scratch might simplify things, but it also means rewriting the entire parsing mechanism from scratch. I'm sure you could implement a GLR parser generator that takes a complete declarative grammar and generates something that goes right from source code to a SAX- or iterparse-style pre-validated AST, and that would be a really cool thing. But besides being a lot of work, it would also be a huge amount of risk. You'd almost certainly end up with new bugs, new places where syntax errors are harder to diagnose, new places where compiling is slower than it used to be, etc. Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible. Meanwhile, adding functions to create a token state struct out of a Python iterable, drive it, and expose that functionality to Python is a lot less work, very unlikely to have any effect on the existing default mechanism (if you don't hook the token stream, the existing code runs the same as today, except for an if check inside the next-token function), and much easier to make exactly compatible with existing behavior even when you do hook the token stream (if creating and driving the token state works, all the other code is the same as it ever was). And it has no backward compat implications.
If it's hard to write a fast parser, then consider writing a parser generator in Python that generates the C code you want.
It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).

On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes, I think I understand your motivation. Can you help me understand the what the hook you would write would look like?
Yes.
I don't see why that is. The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser. Then you wouldn't need to add special hooks for lexing. You would merely have hooks for parsing.
I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places. With every change to the Python grammar, you pay for this forced repetition.

First, I think your idea is almost completely tangential to mine. Yes, if you completely replaced both the interface and the implementation of the parser, you could do just about anything you wanted. But assuming nobody is going to completely replace the way Python does parsing today, I think it's still useful to add the one missing useful hook to the existing system. But let's continue. On Friday, June 5, 2015 7:08 PM, Neil Girdhar <mistersheik@gmail.com> wrote: On Fri, Jun 5, 2015 at 6:58 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Jun 5, 2015, at 13:38, Neil Girdhar <mistersheik@gmail.com> wrote:
Hooking the token stream, on the other hand, has pretty obvious uses. For example, in the user-defined literal thread, Paul Moore suggested that for Nick Coghlan's "compile-time expression" idea, requiring valid Python syntax would be way too restrictive, but requiring valid Python tokens is probably OK, and it automatically solves the quoting problem, and it would usually be easier than parsing text. I think he's right about all three parts of that, but unfortunately you can't implement it that way in an import hook because an import hook can't get access to the token stream.
And of course my hack for simulating user-defined literals relies on a workaround to fake hooking the token stream; it would be a whole lot harder without that, while it would be a little easier and a whole lot cleaner if I could just hook the token stream.
Yes, I think I understand your motivation. Can you help me understand the what the hook you would write would look like?
That's a fair question. First, let's look at the relevant parts of an AST hook that transforms every float literal into a Decimal constructor call: class FloatNodeWrapper(ast.NodeTransformer): def visit_Num(self, node): if isinstance(node.n, float): return ast.Call(func=ast.Name(id='Decimal', ctx=ast.Load()), args=[ast.Str(s=repr(node.n))], keywords=[]) return node # ... def source_to_code(self, data, path, *, _optimize=-1): source = importlib.decode_source(data) tree = ast.parse(source) tree = FloatNodeWrapper().visit(tree) ast.fix_missing_locations(tree) return compile(tree, path, 'exec', dont_inherit=True, optimize=_optimize) Now, here's what I'd like to write for a token hook that does the same thing at the token level: def retokenize_float(tokens): for num, val, *loc in tokens: if num == tokenize.NUMBER and ('.' in val or 'e' in val or 'E' in val): yield tokenize.NAME, 'Decimal', *loc yield tokenize.OP, '(', *loc yield tokenize.STRING, repr(val), *loc yield tokenize.OP, ')', *loc else: yield num, val, *loc # ... def source_to_code(self, data, path, *, _optimize=-1): source = importlib.decode_source(data) tokens = tokenize.tokenize(source) tokens = retokenize(tokens) return compile(tokens, path, 'exec', dont_inherit=True, optimize=_optimize) Of course I don't want to do the same thing, I want to do something that you can't do at the AST level—see my user literal hack for an example. But this shows the parallels and differences between the two. If you want more background, see http://stupidpythonideas.blogspot.com/2015/06/hacking-python-without-hacking... (which I wrote to explain to someone else how floatliteralhack works). Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).
Also, Python is defined as hacking a separate lexical analysis phase, and a module named tokenize that does the same thing as this phase, and tests that test it, and so on. So, you can't just throw all that out and remain backward compatible.
I don't see why that is. The lexical "phase" would just become new parsing rules, and so it would be supplanted by the parser. Then you wouldn't need to add special hooks for lexing. You would merely have hooks for parsing.
Callbacks from a tree builder and using a user-modifiable grammar are clearly not backward compatible with ast.NodeTransformer. They're a completely different way of doing things. Is it a better way? Maybe. Plenty of people are using OMeta/JS every day. Of course plenty of people are cursing the fact that OMeta/JS sometimes generates exponential-time backtracking and it's never clear which part of your production rules are to blame, or the fact that you can't get a useful error message out of it, etc. And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.
It's not that it's hard to write a _fast_ parser in Python, but that it's hard to write a parser that does the exact same thing as Python's own parser (and even more so one that does the exact same thing as whatever version of Python you're running under's parser).
I think it's worth exploring having the whole parser in one place, rather than repeating the same structures in at least four places. With every change to the Python grammar, you pay for this forced repetition.
The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…

On Sat, Jun 6, 2015 at 1:30 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes. I want to point that if the lexer rules were alongside the parser, they would be generating ast nodes — so the hook for calling Decimal for all floating point tokens would be doable in the same way as your AST hook. For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
This is like saying "I want to change some things, but not other things". I want the best long-term solution, whatever that is. (I don't know what it is.) In the long run, moving towards the best solution tends to be the least total work. Specifically, if lexing hooks are implemented differently than parsing hooks, then that change is probably going to be backed out and replaced with the "ideal" — eventually — maybe in five years or ten or twenty. And when it's removed, there's deprecation periods and upgrade pains. At least let's explore what is the "ideal" solution?
I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
Yes, totally agree with you. So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python. And then I would try to convince the Python team to use that. I guess we don't disagree at all.
Cool, I didn't know it was even possible.

No. The way Python currently exposes things, the AST hook runs on an already-generated AST and transforms it into another one, to hand off to the code generator. That means it can only be used to handle things that parse as legal Python syntax (unless you replace the entire parser). What I want is a way to similarly take an already-generated token stream and transform it into another one, to hand off to the parser. That will allow it to be used to handle things that lex as legal Python tokens but don't parse as legal Python syntax, like what Paul suggested. Merging lexing into parsing not only doesn't give me that, it makes that impossible.
For the new tokens that you want, the ideal solution I think is to modify the python parsing grammar before it parses the text.
But I don't want any new tokens. I just want to change the way existing tokens are interpreted. Just as with an AST hook like PyMacro, I don't want any new nodes, I just want to change the way existing nodes are interpreted.
Of course I'm not presenting this as an ideal design if I were starting Python from scratch, but as the best design given what Python is already producing and consuming (a stream of tokens that's guaranteed to be equivalent to what you get out of the tokenize module).
This is like saying "I want to change some things, but not other things".
That's exactly what I'm saying. In particular, I want to change as few things as possible, to get what I want, without breaking stuff that already demonstrably works and has worked for decades.
I don't know about OMeta, but the Earley parsing algorithm is worst-cast cubic time "quadratic time for unambiguous grammars, and linear time for almost all LR(k) grammars".
I don't know why you'd want to use Earley for parsing a programming language. IIRC, it was the first algorithm that could handle rampant ambiguity in polynomial time, but that isn't relevant to parsing programming languages (especially one like Python, which was explicitly designed to be simple to parse), and it isn't relevant to natural languages if you're not still in the 1960s, except in learning the theory and history of parsing. GLR does much better in almost-unambiguous/almost-deterministic languages; CYK can be easily extended with weights (which propagate sensibly, so you can use them for a final judgment, or to heuristically prune alternatives as you go); Valiant is easier to reason about mathematically; etc. And that's just among the parsers in the same basic family as Earley. Also, the point of OMeta is that it's not just a parsing algorithm, it's a complete system that's been designed and built and is being used to write DSLs, macros, and other language extensions in real-life code in languages like JavaScript and C#. So you don't have to imagine what kind of interface you could present or what it might be like to use it in practice, you can use it and find out. And I think it's in the same basic direction as the kind of interface you want for Python's parser.
And I'm pretty sure you could design something with most of the strengths of OMeta without its weaknesses (just using a standard packrat PEG parser instead of an extended PEG parser seems like it would turn most of the exponential productions into explicit errors in the grammar…). Or you could go the opposite way and use GLR and bottom-up callbacks instead of PEG and top-down. Something like that would be a great idea for a research project. But it's probably not a great idea for a proposal to change the core of CPython, at least until someone does that research project.
Yes, totally agree with you. So if it were me doing this work, I would put my energy in the research project to write an amazing parser in Python. And then I would try to convince the Python team to use that. I guess we don't disagree at all.
Well, I think we disagree about the value of our time, and about the cost of disruptive changes. If I have a relatively low-work, almost completely non-disruptive way to definitely get everything I actually need, and a high-work, hugely-disruptive way to probably get what I actually need and also probably get a whole bunch of other useful stuff that I might be able to sell everyone else on if I also did a lot of additional work, that seems like a no-brainer to me. In fact, even if I wanted to write an amazing parser library for Python (and I kind of do, but I don't know if I have the time), I still don't think I'd want to suggest it as a replacement for the parser in CPython. Writing all the backward-compat adapters and porting the Python parser over with all its quirks intact and building the tests to prove that it's performance and error handling were strictly better and so on wouldn't be nearly as much fun as other things I could do with it.

On Sat, Jun 6, 2015 at 3:17 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
Yes, and I what I was suggesting is for the lexer to return AST nodes, so it would be fine to process those nodes in the same way.
Yes, I see *how* you're trying to solve your problem, but my preference is to have one kind of hook rather than two kinds by unifying lexing and parsing. I think that's more elegant.
I suggested Earley to mitigate this fear of "exponential backtracking" since that won't happen in Earley.
If you ever decide to write that amazing parser library for Python and want any help please feel free to let me know. Best, Neil

On Jun 6, 2015, at 09:23, Neil Girdhar <mistersheik@gmail.com> wrote:
Seriously? Tokens don't form a tree, they form a list. Yes, every linked list is just a degenerate tree, so you could have every "node" just include the next one as a child. But why? Do you want to then the input text into a tree of character nodes? Python has all kinds of really good tools for dealing with iterables; why take away those tools and force me to work with a more complicated abstraction that Python doesn't have any tools for dealing with? In the case of the user-defined literal hack, for example, I can use the adjacent-pairs recipe from itertools and my transformation becomes trivial. I did it more explicitly in the hack I uploaded, using a generator function with a for statement, just to make it blindingly obvious what's happening. But if I had to deal with a tree, I'd either have to write explicit lookahead or store some state explicitly on the tree or the visitor. That isn't exactly _hard_, but it's certainly _harder_, and for no benefit. Also, if we got my change, I could write code that cleanly hooks parsing in 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people can at least use it, and all of the relevant and complicated code would be shared between the two versions. With your change, I'd have to write code that was completely different for 3.6+ than what I could backport, meaning I'd have to write, debug, and maintain two completely different implementations. And again, for no benefit. And finally, once again: we already have a token stream as part of the process, we already expose every other interesting step in the process, exposing the token stream as it exists today in a way that fits into everything else as it exists today is clearly the easiest and least disruptive thing to do. Sometimes it's worth doing harder or more disruptive things because they provide more benefit, but you haven't yet shown any benefit. You asked me for examples, and I provided them. Why don't you try writing a couple of actual examples--user literals, the LINQ-ish example from MacroPy, whatever--using your proposed design to show us how they could be simpler, or more elegant, or open up further possibilities. Or come up with an example of something your design could do that the existing one (even with my small proposed change) can't.
I'm trying to find a way to interpret this that makes sense. I think you're suggesting that we should throw out the idea of letting users write and install simple post-processing hooks in Python, because that will force us to find a way to instead make the entire parsing process user-customizable at runtime, which will force users to come up with "more elegant" solutions involving changing the grammar instead of post-processing it macro-style. If so, I think that's a very bad idea. Decades of practice in both Python and many other languages (especially those with built-in macro facilities) shows that post-processing at the relevant level is generally simple and elegant. Even if we had a fully-runtime-customizable parser, something like OMeta but "closing the loop" and implementing the language in the programmable metalanguage, many things are still simpler and more elegant written post-processing style (as used by existing import hooks, including MacroPy, and in other languages going all the way back to Lisp), and there's a much lower barrier to learning them, and there's much less risk of breaking the compiler/interpreter being used to run your hook in the first place. And, even if none of that were true, and your new and improved system really were simpler in every case, and you had actually built it rather than just envisioning it, there's still backward compatibility to think of. Do you really want to break working, documented functionality that people have written things like MacroPy on top of, even if forcing them to redesign and rewrite everything from scratch would force them to come up with a "more elegant" solution? And finally, the added flexibility of such a system is a cost as well as a benefit--the fact that Arc makes it as easy as possible to "rewrite the language into one that makes writing your application trivial" also means that one Arc programmer can't understand another's code until putting in a lot of effort to learn his idiosyncratic language.
I already explained that using standard PEG with a packrat parser instead of extended PEG with an OMeta-style parser gives you linear time. Why do you think telling me about a decades-older cubic-time algorithm designed for parsing natural languages that's a direct ancestor to two other algorithms I also already mentioned is going to be helpful? Do you not understand the advantages of PEG or GLR over Earley?

On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
I don't think I've said this explicitly yet, but I'm +1 on the idea of making it easier to "hack the token stream". As Andew has noted, there are two reasons this is an interesting level to work at for certain kinds of modifications: 1. The standard Python tokeniser has already taken care of converting the byte stream into Unicode code points, and the code point stream into tokens (including replacing leading whitespace with the structural INDENT/DEDENT tokens) 2. You get to work with a linear stream of tokens, rather than a precomposed tree of AST nodes that you have to traverse and keep consistent If all you're wanting to do is token rewriting, or to push the token stream over a network connection in preference to pushing raw source code or fully compiled bytecode, a bit of refactoring of the existing tokeniser/compiler interface to be less file based and more iterable based could make that easier to work with. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jun 7, 2015, at 03:30, random832@fastmail.us wrote:
I'm pretty sure that just doing nothing special here means you get a SyntaxError from the parser. Although I probably need more test cases. Anyway, this is one of those cases I mentioned where the SyntaxError can't actually show you what's wrong with the code, because the actual source doesn't have an error in it, only the transformed token stream. But there are easier ways to get that--just replace a `None` with a `with` in the token stream and you get an error that shows you a perfectly valid line, with no indication that a hook has screwed things up for you. I think we can at least detect that the tokens don't match the source line and throw in a note to go look for an installed token-transforming hook. It would be even nicer if we could show what the untokenized line looks like, so the user can see why it's an error. Something like this: File "<input>", line 1 if spam is None: ^ SyntaxError: invalid syntax Tokens do not match input, parsed as if spam is with : Of course in the specific case you mentioned of unbalanced parens swallowing a dedent, the output still wouldn't be useful, but I'm not sure what we could show usefully in that case anyway.

On Sun, Jun 7, 2015, at 09:24, Andrew Barnert wrote:
I'm pretty sure that just doing nothing special here means you get a SyntaxError from the parser. Although I probably need more test cases.
I'm actually talking about what happens if the _untransformed_ stream contains an unbalanced bracket that the hook is supposed to eliminate (and/or supply the balancing one). My mental model of this idea was that the "lexer" generates the entire untransformed (but including indent/dedent magic etc) token sequence, then supplies it to the hook.

random832@fastmail.us wrote:
I'm actually talking about what happens if the _untransformed_ stream contains an unbalanced bracket that the hook is supposed to eliminate
I'm of the opinion that designing an input language to require or allow unmatched brackets is a bad idea. If nothing else, it causes grief for editors that do bracket matching. -- Greg

On Jun 6, 2015, at 22:59, Nick Coghlan <ncoghlan@gmail.com> wrote:
Actually, as I discovered while trying to hack in the change this afternoon, the C tokenizer doesn't actually take care of conveying the byte stream. It does take care of detecting the encoding, but what it hands to the parsetok function is still encoded bytes. The Python wrapper does transparently decode for you (in 3.x), but that actually just makes it harder to feed the output back into the parser, because the parser wants encoded bytes. (Also, as I mentioned before, it would be nice if the Python wrapper could just take Unicode in the first place, because the most obvious place to use this is in an import hook, where you can detect and decode the bytes yourself in as single line, and it's easier to just use the string than to encode it to UTF-8 so the tokenizer can detect UTF-8 so either the Python tokenizer wrapper or the C parser can decode it again...). Anyway, this part was at least easy to temporarily work around; the stumbling block that prevented me from finishing a working implementation this afternoon is a bit hairier. The C tokenizer hands the parser the current line (which can actually be multiple lines) and start and end pointers to characters within that line. It also hands it the current token string, but the parser ignores that and just reads from line+start to line+end. The Python tokenizer, on the other hand, gives you line number and (Unicode-based) column numbers for start and end. Converting those to encoded-bytes offsets isn't _that_ hard... but those are offsets into the original (encoded) line, so the parser is going to see the value of the original token rather than the token value(s) you're trying to substitute, which defeats the entire purpose. I was able to implement a hacky workaround using untokenize to fake the current line and provide offsets within that, but that means you get garbage from SyntaxErrors, and all your column numbers--and, worse, all your line numbers, if you add in a multi-line token--are off within the AST and bytecode. (And there may be other problems; those are just the ones I saw immediately when I tried it...) I think what I'm going to try next is to fork the whole parsetok function and write a version that uses the token's string instead of the substring of the line, and start and stop as offsets instead of pointers. I'm still not sure whether the token string and line should be in tok->encoding, UTF-8, UTF-32, or a PyUnicode object, but I'll figure that out as I do it.... Once I get that working for the wrapped-up token iterator, then I can see if I can reunify it with the existing version for the C tokenizer (without any performance penalty, and without breaking pgen). I'd hate to have two copies of that giant function to keep in sync. Meanwhile, I'm not sure what to do about tokens that don't have the optional start/stop/line values. Maybe just not allow them (just because untokenize can handle it doesn't mean ast.parse has to), or maybe just untokenize a fake line (and if any SyntaxErrors are ugly and undebuggable, well, don't skip those values...). The latter might be useful if for some reason you wanted to generate tokens on the fly instead of just munging a stream of tokens from source text you have available. I'm also not sure what to do about a few error cases. For example, if you feed the parser something that isn't iterable, or whose values aren't an iterable of iterables of length 2 to 5 with the right types, that really feels more like a TypeError than a SyntaxError (and that would also be a good way to signal the end user that the bug is in the token stream transformer rather than in the source code...), but raising a TypeError from within the parser requires a bit more refactoring (the tokenizer can't tell you what error to raise, just that the current token is an error along with a tokenizer error code--although I code add an E_NOTOK error code that the parser interprets as "raise a TypeError instead of a SyntaxError"), and I'm not sure whether that would affect any other stuff. Anyway, for the first pass I may just leave it as a SyntaxError, just to get something working. Finally, it might be nice if it were possible to generate a SyntaxError that showed the original source line but also told you that the tokens don't match the source (again, to signal the end user that he should look at what the hook did to his code, not just his code), but I'm not sure how necessary that is, or how easy it will be (it depends on how I end up refactoring parsetok).
I didn't think about that use case at all, but that could be very handy.

On Sun, Jun 7, 2015 at 1:59 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I will explain in another message how to replace the indent and dedent tokens so that the lexer loses most of its "magic" and becomes just like the parser.
The AST nodes would contain within them the linear stream of tokens that you are free to work with. The AST also encodes the structure of the tokens, which can be very useful if only to debug how the tokens are being parsed. You might find yourself, when doing a more complicated lexical transformation, trying to reverse engineer where the parse tree nodes begin and end in the token stream. This would be a nightmare. This is the main problem with trying to process the token stream "blind" to the parse tree.
You can still do all that with the tokens included in the parse tree.

On 8 June 2015 at 12:23, Neil Girdhar <mistersheik@gmail.com> wrote:
I don't dispute that this *can* be done, but what would it let me do that I can't already do today? I addition, how will I be able to continue to do all the things that I can do today with the separate tokenisation step? *Adding* steps to the compilation toolchain is doable (one of the first things I was involved in CPython core development was the introduction of the AST based parser in Python 2.5), but taking them *away* is much harder. You appear to have an idealised version of what a code generation toolchain "should" be, and would like to hammer CPython's code generation pipeline specifically into that mould. That's not the way this works - we don't change the code generator for the sake of it, we change it to solve specific problems with it. Introducing the AST layer solved a problem. Introducing an AST optimisation pass would solve a problem. Making the token stream easier to manipulate would solve a problem. Merging the lexer and the parser doesn't solve any problem that we have.
Anything that cares about the structure to that degree shouldn't be manipulating the token stream - it should be working on the parse tree.
Not as easily, because I have to navigate the parse tree even when I don't care about that structure, rather than being able to just look at the tokens in isolation. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Jun 7, 2015 at 10:42 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
You're right. And as usual, Nick, your analysis is spot on. My main concern is that the idealized way of parsing the language is not precluded by any change. Does adding token manipulation promise forwards compatibility? Will a Python 3.9 have to have the same kind of token manipulator exposed. If not, then I'm +1 on token manipulation. :)
I don't think it would be more of a burden than it would prevent bugs by allowing you to ensure that the parse tree structure is what you think it is. It's a matter of intuition I guess.

On 8 June 2015 at 12:47, Neil Girdhar <mistersheik@gmail.com> wrote:
That may have been the heart of the confusion, as token manipulation is *already* a public feature: https://docs.python.org/3/library/tokenize.html The tokenizer module has been a public part of Python for longer than I've been a Pythonista (first documented in 1.5.2 in 1999): https://docs.python.org/release/1.5.2/lib/module-tokenize.html As a result, token stream manipulation is already possible, you just have to combine the tokens back into a byte stream before feeding them to the compiler. Any future Python interpreter would be free to fall back on implementing a token based API that way, if the CPython code generator itself were to gain a native token stream interface. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jun 6, 2015 at 6:52 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
When the work done by the lexer is done by the parser, the characters contained in a lexical node will be siblings. Typically, in Python a tree is represented as nodes with iterable children, so the characters would just be a string.
The stream would still be just an iterable.
You would still be able to do all this stuff.
If I find time, I'll do that. I will explain my solution in another message.
I understand that you are motivated by a specific problem. However, your solution does not solve the general problem. If you only allow transformations of the token stream, the token set is fixed. Transformations of the token stream also hide — even in your example — the fact that you're actually building what is conceptually a subtree. It makes more sense to me to solve the problem in general, once and for all. (1) Make it easy to change the grammar, and (2) make lexing part of the grammar. Now, you don't have to change the grammar to solve some problems. Sometimes, you can just use AST transformers to accomplish what you were doing with a lexical transformer. That's nice because it's one less thing to learn. Sometimes, you need the power of changing the grammar. That is already coming with the popularity of languages like Theano. I really want to transform Python code into Theano, and for that it may be more elegant to change the grammar.

On 6 June 2015 at 15:30, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The repetition is really a different issue. A different implementation of the same basic design Python already has could make it so you only have to write explicit code for the 3 places the CST->AST node doesn't follow the same rules as everywhere else and the dozen or so places where the AST has to be post-validated, instead of having to write explicit code for both sides of every single node type. And that kind of cleanup could be done without breaking backward compatibility, because the interfaces on each side of the code would be unchanged. But that's also a lot less fun of a change than writing a whole new parser, so I wouldn't be surprised if nobody ever did it…
Eugene Toder had a decent go at introducing more autogeneration into the code generation code a few years ago as part of building out an AST level optimiser: http://bugs.python.org/issue11549 The basic concepts Eugene introduced still seem sound to me, there'd just be some work in bringing the patches up to date to target 3.6. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (12)
-
Andrew Barnert
-
David Mertz
-
Greg Ewing
-
Guido van Rossum
-
Luciano Ramalho
-
Neil Girdhar
-
Nick Coghlan
-
random832@fastmail.us
-
Robert Collins
-
Ryan Gonzalez
-
s.krah
-
Stephen J. Turnbull