I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator
that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).