The documentation is beautiful.

One of the features I was looking for when evaluating parsers was the ability to run expression on the rules. For example if you matching

"\begin{\w*}"

and \w* turns out to be "enumeration", then later when you match

"\end{\w*}"

then you want to check that that \w* is also enumeration or else raise an error.

A similar thing happens when you're trying to parse Python source code with the indentation level. You might want to check that the next indentation level is the same or corresponds to a dedent.

Expressions on rules should be able to control whether something matches, should be able to store values in the parse tree, to store values that can be read by other expressions, and be able to raise parsing errors.

The beauty of expressions is that you can do the parsing and build the AST in one shot. If you've ever looked at the Python source code, it is unfortunate that those tasks have to be done separately even though most changes to the AST require parsing changes.

The most modern parsing algorithms have this. The old parsing libraries (lex/yacc, flex/bison, antlr) were very limited.

Also, I'm not sure the separation between tokenization and parsing is necessary if you're not worried about efficiency.

Best,

Neil

On Monday, July 15, 2019 at 9:45:59 PM UTC-4, Nam Nguyen wrote:

Hello list,

I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.

GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.

With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:

#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.

#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.

#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.

#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.

I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).

Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?

I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).

Thanks!
Nam