Universal parsing library in the stdlib to alleviate security issues
Hello list, I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately. GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years. With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics: #. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts. #. Small and verifiable. This helps build trust in the code that is meant to plug security holes. #. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning. #. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags. I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;). Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like? I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;). Thanks! Nam
On Jul 15, 2019, at 18:44, Nam Nguyen <bitsink@gmail.com> wrote:
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
For at least some of your use cases, I don’t think it’s a problem that it’s 70x slower than the custom parsers you’d be replacing. How often do you need to parse a million URLs in your inner loop? Also, if the function composition is really the performance hurdle, can you optimize that away relatively simply, just by building an explicit tree (expression-template style) and walking the tree in a __call__ method, rather than building an implicit tree of nested calls? (And that could be optimized further if needed, e.g. by turning the tree walk into a simple virtual machine where all of the fundamental operations are inlined into the loop, and maybe even accelerating that with C code.) But I do think it’s a problem that there seems to be no way to usefully indicate failure to the caller, and I’m not sure that could be fixed as easily. Invalid inputs in your readme examples don’t fail, they successfully return an empty set. There also doesn’t seem to be any way to trigger a hard fail rather than a backtrack. So I’m not sure how a real urlparse replacement could do the things the current one does, like raising a ValueError on https://abc.d[ef.ghi/ complaining that the netloc looks like an invalid IPv6 address. (Maybe you could def a function that raises a ValueError and attach it as a where somewhere in the parser tree? But even if that works, wouldn’t you get a meaningless exception that doesn’t have any information about where in the source text or where in the parse tree it came from or why it was raised, and, as your readme says, a stack trace full of garbage?) Can you add failure handling without breaking the “~200LOC and easy to read” feature of the library, and without breaking the “easy to read once you grok parser combinators” feature of the parsers built with it?
On 16 Jul 2019, at 04:47, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
How often do you need to parse a million URLs in your inner loop?
As it happens i work on code that would be impacted by such a slow down. It runs in production 24x7 and parses URLs on its critical path. Barry
On Tue, Jul 16, 2019 at 1:18 PM Barry <barry@barrys-emacs.org> wrote:
On 16 Jul 2019, at 04:47, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
How often do you need to parse a million URLs in your inner loop?
As it happens i work on code that would be impacted by such a slow down. It runs in production 24x7 and parses URLs on its critical path.
It does not have to be either or ;). In general, I would prefer security / correctness over performance. But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*. Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.
Barry
On 17 Jul 2019, at 05:17, Nam Nguyen <bitsink@gmail.com> wrote:
On Tue, Jul 16, 2019 at 1:18 PM Barry <barry@barrys-emacs.org <mailto:barry@barrys-emacs.org>> wrote:
On 16 Jul 2019, at 04:47, Andrew Barnert via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:
How often do you need to parse a million URLs in your inner loop?
As it happens i work on code that would be impacted by such a slow down. It runs in production 24x7 and parses URLs on its critical path.
It does not have to be either or ;). In general, I would prefer security / correctness over performance.
Of course.
But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*.
Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.
Yes 1ms would be a serious issue. I guess what I'm concerned about is the use of a universal parser for a benefit I'm not clear exists having a terrible affect of the speed of code that is secure and correct. Barry
Barry
On Wed, Jul 17, 2019 at 12:38 AM Barry Scott <barry@barrys-emacs.org> wrote:
But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*.
Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.
Yes 1ms would be a serious issue.
I guess what I'm concerned about is the use of a universal parser for a benefit I'm not clear exists having a terrible affect of the speed of code that is secure and correct.
I hope you are doubting that *my* library has not proven itself yet, rather than the benefit if a proper parsing package. If it was about the second, perhaps these links could convince you that URL parsing was simply not "secure and correct": https://www.cvedetails.com/cve/CVE-2019-10160/ urlsplit and urlparse regress from the fix below. https://www.cvedetails.com/cve/CVE-2019-9636/ urlsplit and urlparse vs unicode normalization. https://bugs.python.org/issue30500 urlparse's handling of # in password And many more related to HTTP header, cookie, email... These things are tricky. It is fair to use the numbers I currently have (~900 usec vs 13 usec per parse) to ballpark the impact but I'm pretty confident that performance will sort itself out acceptably, eventually (and even with an entirely different library). If 1ms is a deal breaker, what is a more palatable latency? How much latency would you trade for security? Thanks. Nam
On Thu, Jul 18, 2019 at 2:25 PM Nam Nguyen <bitsink@gmail.com> wrote:
If 1ms is a deal breaker, what is a more palatable latency? How much latency would you trade for security?
Have you proven that this universal parser will solve the security problems? My apologies if you have and I missed it. ChrisA
On 18 Jul 2019, at 05:23, Nam Nguyen <bitsink@gmail.com> wrote:
On Wed, Jul 17, 2019 at 12:38 AM Barry Scott <barry@barrys-emacs.org <mailto:barry@barrys-emacs.org>> wrote:
But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*.
Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.
Yes 1ms would be a serious issue.
I guess what I'm concerned about is the use of a universal parser for a benefit I'm not clear exists having a terrible affect of the speed of code that is secure and correct.
I hope you are doubting that *my* library has not proven itself yet, rather than the benefit if a proper parsing package. If it was about the second, perhaps these links could convince you that URL parsing was simply not "secure and correct":
https://www.cvedetails.com/cve/CVE-2019-10160/ <https://www.cvedetails.com/cve/CVE-2019-10160/> urlsplit and urlparse regress from the fix below. https://www.cvedetails.com/cve/CVE-2019-9636/ <https://www.cvedetails.com/cve/CVE-2019-9636/> urlsplit and urlparse vs unicode normalization. https://bugs.python.org/issue30500 <https://bugs.python.org/issue30500> urlparse's handling of # in password
That is a correlation, as Chris said, show causation. I'm working on the impact of CVE-2019-9636 as part of my day-job. I'd be interesting in your analysis of how you parsing proposal would have avoided this problem before it was described. I add the "before it was described" because I think you are claiming that the universal parser will prevent this class of security issue by the nature of its design.
And many more related to HTTP header, cookie, email... These things are tricky.
Parsing of HTTP headers is not that hard. Have a look at the code in twisted library that handles all this pretty well. Where HTTP gets complex is the semantic meaning of the headers. Where performance matters is when handling very large HTTP messages. We see 1GiB POST's to backup servers that contain 1 million small files mime encoded in them, and python can handle that well enough. But not with a x70 slow down.
It is fair to use the numbers I currently have (~900 usec vs 13 usec per parse) to ballpark the impact but I'm pretty confident that performance will sort itself out acceptably, eventually (and even with an entirely different library).
If 1ms is a deal breaker, what is a more palatable latency?
You need to be very close to the 13us mark. For correctness and security I will take the hit of the patch to CVE-2019-9636, which will be a few microseconds at most.
How much latency would you trade for security?
Please show your approach can deliver that extra security. Barry
Thanks. Nam
On Thu, Jul 18, 2019 at 12:07 AM Barry Scott <barry@barrys-emacs.org> wrote:
On 18 Jul 2019, at 05:23, Nam Nguyen <bitsink@gmail.com> wrote:
On Wed, Jul 17, 2019 at 12:38 AM Barry Scott <barry@barrys-emacs.org> wrote:
But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*.
Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.
Yes 1ms would be a serious issue.
I guess what I'm concerned about is the use of a universal parser for a benefit I'm not clear exists having a terrible affect of the speed of code that is secure and correct.
I hope you are doubting that *my* library has not proven itself yet, rather than the benefit if a proper parsing package. If it was about the second, perhaps these links could convince you that URL parsing was simply not "secure and correct":
https://www.cvedetails.com/cve/CVE-2019-10160/ urlsplit and urlparse regress from the fix below. https://www.cvedetails.com/cve/CVE-2019-9636/ urlsplit and urlparse vs unicode normalization. https://bugs.python.org/issue30500 urlparse's handling of # in password
That is a correlation, as Chris said, show causation.
There has been no proof so far. And this is actually good discussion because it now focuses on how a parsing library might help, not particular about my lousy implementation ;).
I'm working on the impact of CVE-2019-9636 as part of my day-job.
I'd be interesting in your analysis of how you parsing proposal would have avoided this problem before it was described. I add the "before it was described" because I think you are claiming that the universal parser will prevent this class of security issue by the nature of its design.
That's exactly what I'd like the list to consider. I claim that a parsing library would prevent this class of issues simply because it follows the specification to the letter. As you'll find in my code, the grammar is trivial translation of what is in the spec, one for one. Not only will this help explain what the code is doing, but also raise the confidence that the code matches the spec. Full disclosure, I fixed bpo30500, but I was not at all confident that the fix was correct. I understood what the code did; I just couldn't tell if that was the right thing to do.
And many more related to HTTP header, cookie, email... These things are tricky.
Parsing of HTTP headers is not that hard. Have a look at the code in twisted library that handles all this pretty well. Where HTTP gets complex is the semantic meaning of the headers. Where performance matters is when handling very large HTTP messages. We see 1GiB POST's to backup servers that contain 1 million small files mime encoded in them, and python can handle that well enough. But not with a x70 slow down.
When performance matters, there are alternatives. For example, XML parsing can be done with elementtree. And someone smarter than me can implement a much faster parsing library to accommodate both security and performance.
It is fair to use the numbers I currently have (~900 usec vs 13 usec per parse) to ballpark the impact but I'm pretty confident that performance will sort itself out acceptably, eventually (and even with an entirely different library).
If 1ms is a deal breaker, what is a more palatable latency?
You need to be very close to the 13us mark. For correctness and security I will take the hit of the patch to CVE-2019-9636, which will be a few microseconds at most.
I have to admit there's a long way before my package could get there, or ever. But I won't be surprise when some alternative parsing libraries can deliver that.
How much latency would you trade for security?
Please show your approach can deliver that extra security.
I am using same inputs in those references, and not changing anything in the parser: bpo30500: 'http://127.0.0.1#@evil.com/' is parsed as scheme http, host 127.0.0.1, fragment @evil.com, as expected. Path is an empty string, query is None. CVE-2019-9636 (aka bpo36216): 'https://example.com\uFF03@bing.com' parses incompletely to host example.com, leaves '\uff03@bing.com' in remain. The normalized string parses to host example.com, and fragment @bing.com. In both cases, the scheme is https. Since this works correctly, there is no CVE-2019-10160. Furthermore, it should be noted here that query is correctly marked as None, instead of empty string. Those are trivial examples and won't be able to *really* prove anything just yet, but at least they show this approach is worth looking into, that it could have prevented issues before they become real. Cheers, Nam
On 19 Jul 2019, at 02:12, Nam Nguyen <bitsink@gmail.com> wrote:
I'm working on the impact of CVE-2019-9636 as part of my day-job.
I'd be interesting in your analysis of how you parsing proposal would have avoided this problem before it was described. I add the "before it was described" because I think you are claiming that the universal parser will prevent this class of security issue by the nature of its design.
That's exactly what I'd like the list to consider.
I claim that a parsing library would prevent this class of issues simply because it follows the specification to the letter.
Is the specification complete and unambiguous? Too often there are gaps in a specifications that go unnoticed and lead the problems. Was the specification updated but the code to match it was not? While a parsing library may, or may not, make an implementation easier to write and review there is still room for human error. Mistakes in understanding a specification are made by engineers and lead to problems that sometimes take years to discover. In short I think this may reduce problems, but "prevent" them I do not believe.
As you'll find in my code, the grammar is trivial translation of what is in the spec, one for one. Not only will this help explain what the code is doing, but also raise the confidence that the code matches the spec.
I'm happy to have better tools and making review easier. Of course the parser needs to have a specification, documentation and trusted implementation to get the benefits.
Full disclosure, I fixed bpo30500, but I was not at all confident that the fix was correct. I understood what the code did; I just couldn't tell if that was the right thing to do.
I took at very quick look at bpo30500 and was struck by the comment that the code was working on a URL that had not been validated. Validation of the URL would reject the URL before the parsing happens in this case. Was that the case? If the code needs to run with input that is outside the pre-conditions of the specification then you are in a place where you need to not only fix the code, but state that the code extends the spec in non-standard ways. Barry
On Sun, Jul 21, 2019 at 08:48:49AM +0100, Barry Scott wrote:
I took at very quick look at bpo30500 and was struck by the comment that the code was working on a URL that had not been validated.
Validation of the URL would reject the URL before the parsing happens in this case. Was that the case?
Sorry, can you elaborate on that? How do you validate a URL without attempting to parse it? You're surely not talking about looking it up in a whitelist are you? Thanks, -- Steven
On 21 Jul 2019, at 19:03, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jul 21, 2019 at 08:48:49AM +0100, Barry Scott wrote:
I took at very quick look at bpo30500 and was struck by the comment that the code was working on a URL that had not been validated.
Validation of the URL would reject the URL before the parsing happens in this case. Was that the case?
Sorry, can you elaborate on that? How do you validate a URL without attempting to parse it? You're surely not talking about looking it up in a whitelist are you?
I was thinking about ensuring the the characters in the url are from the subset that is allowed. \n is not allowed for example. Yes agree you have a try to parse it. Barry
Thanks,
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BTACAK... Code of Conduct: http://python.org/psf/codeofconduct/
On Jul 21, 2019, at 14:13, Barry <barry@barrys-emacs.org> wrote:
On 21 Jul 2019, at 19:03, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jul 21, 2019 at 08:48:49AM +0100, Barry Scott wrote:
I took at very quick look at bpo30500 and was struck by the comment that the code was working on a URL that had not been validated.
Validation of the URL would reject the URL before the parsing happens in this case. Was that the case?
Sorry, can you elaborate on that? How do you validate a URL without attempting to parse it? You're surely not talking about looking it up in a whitelist are you?
I was thinking about ensuring the the characters in the url are from the subset that is allowed. \n is not allowed for example. Yes agree you have a try to parse it.
For a spec that has different sets of restricted characters for different parts, that kind of prevalidation doesn’t seem like it would get you very far. At least a priori, if there are attacks that involve using illegal characters in the netloc or the path or the scheme or whatever, they could just as easily be characters that are legal elsewhere in the URL as characters that happen to not be legal anywhere. (If you’re just talking about mitigating one particular attack after it’s been discovered, that’s a different story. If checking for \n patches things without waiting for the library fix, obviously it’s worth doing.)
On Sun, Jul 21, 2019 at 2:50 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
On Jul 21, 2019, at 14:13, Barry <barry@barrys-emacs.org> wrote:
On 21 Jul 2019, at 19:03, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jul 21, 2019 at 08:48:49AM +0100, Barry Scott wrote:
I took at very quick look at bpo30500 and was struck by the comment that the code was working on a URL that had not been validated.
The more worrisome comment is one from Victor Stinner, when he did not know whether a change was intentional or not. https://bugs.python.org/msg296422. That showed a real disconnect between code, and spec. And a parsing library should be able to help with that, provided it is implemented correctly. So, I think we are in agreement in that regard.
Validation of the URL would reject the URL before the parsing happens in this case. Was that the case?
Sorry, can you elaborate on that? How do you validate a URL without attempting to parse it? You're surely not talking about looking it up
in
a whitelist are you?
I was thinking about ensuring the the characters in the url are from the subset that is allowed. \n is not allowed for example. Yes agree you have a try to parse it.
For a spec that has different sets of restricted characters for different parts, that kind of prevalidation doesn’t seem like it would get you very far. At least a priori, if there are attacks that involve using illegal characters in the netloc or the path or the scheme or whatever, they could just as easily be characters that are legal elsewhere in the URL as characters that happen to not be legal anywhere.
Yes, absolutely. For simple cases, I just don't see how string.split, or re.match can be brushed aside. But more complex cases, proper parsing will certainly help with validation. FYI, my current proof of concept parser is at ~300 lines of code, with debugging trace support. Other than performance (which I don't intend to tackle in my library very soon), is there any other concern that I have missed? At the moment, I am still of the opinion that the goal raised in this thread is very attainable, and should be considered. Cheers, Nam
On Jul 23, 2019, at 18:44, Nam Nguyen <bitsink@gmail.com> wrote:
FYI, my current proof of concept parser is at ~300 lines of code, with debugging trace support. Other than performance (which I don't intend to tackle in my library very soon), is there any other concern that I have missed? At the moment, I am still of the opinion that the goal raised in this thread is very attainable, and should be considered.
I personally don’t think anything more needs to be done for a proof of concept (except maybe proving that performance actually is solvable). But what’s the actual proposal here? Even if everyone agrees that this is a nifty idea, there’s nothing to accept until someone writes an acceptable-performance stdlib-ready parser library, the drop-in replacements for the critical bespoke parsing functions, and the tests that verify that they avoid known security problems but otherwise provide the same behavior.
On Tue, Jul 23, 2019 at 8:06 PM Andrew Barnert <abarnert@yahoo.com> wrote:
On Jul 23, 2019, at 18:44, Nam Nguyen <bitsink@gmail.com> wrote:
FYI, my current proof of concept parser is at ~300 lines of code, with
debugging trace support. Other than performance (which I don't intend to tackle in my library very soon), is there any other concern that I have missed? At the moment, I am still of the opinion that the goal raised in this thread is very attainable, and should be considered.
I personally don’t think anything more needs to be done for a proof of concept (except maybe proving that performance actually is solvable).
But what’s the actual proposal here?
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib, and 2) What features it should have. I have proposed that 1) yes, such a library would be useful, and 2) several requirements that such a library should fulfill. Are they acceptable? Apparently not. As you first requested for debug trace, Barry wanted better performance, and Chris asked for proof such library would help. How about the other points I suggested? Do we need a full-blown universal parser? Is LL(1) enough, or do we need k / unlimited lookaheads? How about context sensitive grammars? What performance budget can we spend on this vs the status quo? Do we even care if the parser is small, or that it comes from generated code? Et cetera... There are still plenty of open questions. Even if everyone agrees that this is a nifty idea, there’s nothing to
accept until someone writes an acceptable-performance stdlib-ready parser library, the drop-in replacements for the critical bespoke parsing functions, and the tests that verify that they avoid known security problems but otherwise provide the same behavior.
These are good points to set as targets! What does it take for me to get the list to agree on one such set of criteria? When we get pass that, we can look for solutions to fulfill them. For example, Guido suggested pyparsing. It is indeed faster than my library, but it does not support context sensitive or ambiguous grammars, a requirement that *I* think should be satisfied. To recap, the "proposal" here isn't one specific solution, but an initial set of requirements that solution must fulfill. If you think a parser is useful, let's debate on what form it should take. If you don't think so, let's hear your reasoning too. Thanks! Nam
On 7/24/2019 9:15 PM, Nam Nguyen wrote:
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib, and 2) What features it should have. I have proposed that 1) yes, such a library would be useful, and 2) several requirements that such a library should fulfill.
Are they acceptable? Apparently not. As you first requested for debug trace, Barry wanted better performance, and Chris asked for proof such library would help. How about the other points I suggested? Do we need a full-blown universal parser? Is LL(1) enough, or do we need k / unlimited lookaheads? How about context sensitive grammars? What performance budget can we spend on this vs the status quo? Do we even care if the parser is small, or that it comes from generated code? Et cetera... There are still plenty of open questions.
I would think deciding on LL(1), or a specific k lookahead, or any other parser feature would depend on which problems you're trying to solve. I think you're looking at this backwards. You seem to be saying "here's a parser, now solve problems in the stdlib with it". For something in the stdlib, I think it has to be "here are the problems to be solved, now I'll design a parser to solve them". To take but one example: what if it turns out that a URL can't be parsed with a LL(1) parser, and you need more horsepower? Don't you think you'd need to know that in advance of proposing a LL(1) parser for the stdlib? And herein lies an issue: you can't design a parser that solves every issue that the stdlib will ever have. At best you can only solve the existing problems. But if you do that, what do you do if we want to add something new to the stdlib that your parser doesn't have enough power to parse? To try and be more concretely helpful: at the very least, I think you should propose a specific set of things in the existing stdlib that would benefit from a parser generator, instead of the existing ad hoc parsers being used. Bonus points for showing how the code is simplified and/or made more secure by using a parser generator. Eric
On Wed, Jul 24, 2019 at 6:40 PM Eric V. Smith <eric@trueblade.com> wrote:
On 7/24/2019 9:15 PM, Nam Nguyen wrote:
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib, and 2) What features it should have. I have proposed that 1) yes, such a library would be useful, and 2) several requirements that such a library should fulfill.
Are they acceptable? Apparently not. As you first requested for debug trace, Barry wanted better performance, and Chris asked for proof such library would help. How about the other points I suggested? Do we need a full-blown universal parser? Is LL(1) enough, or do we need k / unlimited lookaheads? How about context sensitive grammars? What performance budget can we spend on this vs the status quo? Do we even care if the parser is small, or that it comes from generated code? Et cetera... There are still plenty of open questions.
I would think deciding on LL(1), or a specific k lookahead, or any other parser feature would depend on which problems you're trying to solve.
I think you're looking at this backwards. You seem to be saying "here's a parser, now solve problems in the stdlib with it".
Yes. I started out with the assumption that a parser library can certainly help with code in the stdlib because I have seen, or got to know of problems (though not *all*) in the stdlib that such library could solve. This is where collective wisdom could help.
For something in the stdlib, I think it has to be "here are the problems to be solved, now I'll design a parser to solve them".
I gave links to CVEs and bugs in BPO. Those are illustrative of problems to be solved. If you know of other places where a parser can help, let's hear it.
To take but one example: what if it turns out that a URL can't be parsed with a LL(1) parser, and you need more horsepower? Don't you think you'd need to know that in advance of proposing a LL(1) parser for the stdlib?
And herein lies an issue: you can't design a parser that solves every issue that the stdlib will ever have.
Not looking nor dreaming of one such ;). But with educated guesses, we could have a set of requirements for some reasonable time. For e.g. we expect to deal mostly with context-free grammars, so anything that can help parse all of them is an acceptable solution.
At best you can only solve the existing problems. But if you do that, what do you do if we want to add something new to the stdlib that your parser doesn't have enough power to parse?
Accept that fact, and work on solutions then. I mean, no one can wholly foresee the future. If a need arises that isn't accommodated by existing tools, well, we invent new tools. I don't see that as impediments to making improvements now.
To try and be more concretely helpful: at the very least, I think you should propose a specific set of things in the existing stdlib that would benefit from a parser generator, instead of the existing ad hoc parsers being used. Bonus points for showing how the code is simplified and/or made more secure by using a parser generator.
The majority part of this thread was about how my PoC could help with urlparse. I'm sorry that the snipping has cut off contexts but you'll be able to find previous exchanges in the archive. Cheers, Nam
On Thu, Jul 25, 2019 at 12:11 PM Nam Nguyen <bitsink@gmail.com> wrote:
On Wed, Jul 24, 2019 at 6:40 PM Eric V. Smith <eric@trueblade.com> wrote:
For something in the stdlib, I think it has to be "here are the problems to be solved, now I'll design a parser to solve them".
I gave links to CVEs and bugs in BPO. Those are illustrative of problems to be solved. If you know of other places where a parser can help, let's hear it.
But they are *individual* problems to be solved, and you're trying to come up with a *general* solution. So you'll need to be very clear about exactly what you're solving, particularly since you're implying that the use of such a parser will prevent these sorts of issues from happening in the future. ChrisA
On Thu, 25 Jul 2019 at 02:16, Nam Nguyen <bitsink@gmail.com> wrote:
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib
In the abstract, no. Propose a specific library, and that answer would change to "maybe".
and 2) What features it should have.
That question only makes sense if you get agreement to the abstract proposal that "we should add a parsing library. And as I said, I don't agree to that so I can't answer the second question. Generally, things go into the stdlib when they have been developed externally and proved their value. The bar for designing a whole library from scratch, "specifically" targeted at stdlib inclusion, is very high, and you're nowhere near reaching it IMO.
These are good points to set as targets! What does it take for me to get the list to agree on one such set of criteria?
You need to start by getting agreement on the premise that adding a newly-written parser to the stdlib is a good idea. And so far your *only* argument seems to be that "it will avoid a class of security bugs" which I find extremely unconvincing (and I get the impression others do, too). But even if "using a real parser" was useful in that context, there's *still* no argument for writing one from scratch, rather than using an existing, proven library. At the most basic level, what if there's a bug in your new parsing library? If we're using it in security-critical code, such a bug would be a vulnerability just like the ones you're suggesting your parser would avoid. Are you asking us to believe that your code will be robust enough to trust over code that's been used in production systems for years? I think you need to stop getting distracted by details, and focus on your stated initial request "Whether we want to have a (possibly private) parsing library in the stdlib". You don't seem to me to have persuaded anyone of this basic suggestion yet, so asking what the library that no-one has agreed to should look like is unlikely to produce useful feedback. You could of course *change* your main proposal, to something like "should we add parsing library X from PyPI to the stdlib", or "should we rewrite the existing URL parsing code in the stdlib from scratch to make it less likely to have security bugs". I doubt either of those would get much support either, but I could be proved wrong. The key here is to focus on a clear proposal, and not get distracted by implementation details until the principle gets enough support. Paul
On Thu, Jul 25, 2019 at 2:32 AM Paul Moore <p.f.moore@gmail.com> wrote:
On Thu, 25 Jul 2019 at 02:16, Nam Nguyen <bitsink@gmail.com> wrote:
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib
In the abstract, no. Propose a specific library, and that answer would change to "maybe".
I have no specific library to propose. I'm looking for a list of features such a library should have.
and 2) What features it should have.
That question only makes sense if you get agreement to the abstract proposal that "we should add a parsing library. And as I said, I don't agree to that so I can't answer the second question.
As Chris summarized it correctly, I am advocating for a general solution to individual problems (which have the same nature). We can certainly solve the problems when they are reported, or we can take a proactive approach to make them less likely to occur. I am talking about a class of input validation issues here and I thought parsing would be a very natural solution to that. This is quite similar to a context-sensitive templating library that prevents cross-site-scripting on the output side. So I don't know why (or what it takes) to convince people that it's a good thing(tm).
Generally, things go into the stdlib when they have been developed externally and proved their value. The bar for designing a whole library from scratch, "specifically" targeted at stdlib inclusion, is very high, and you're nowhere near reaching it IMO.
This is a misunderstanding. I have not proposed any from-scratch, or existing library to be used. And on this note, please allow me to make it clear once more time that I am not asking for a publicly-facing library either.
These are good points to set as targets! What does it take for me to get the list to agree on one such set of criteria?
You need to start by getting agreement on the premise that adding a newly-written parser to the stdlib is a good idea. And so far your *only* argument seems to be that "it will avoid a class of security bugs" which I find extremely unconvincing (and I get the impression others do, too).
Why? What is unconvincing about a parsing library being able... parse (and therefore, validate) inputs?
But even if "using a real parser" was useful in that context, there's *still* no argument for writing one from scratch, rather than using an existing, proven library.
Never a goal.
At the most basic level, what if there's a bug in your new parsing library? If we're using it in security-critical code, such a bug would be a vulnerability just like the ones you're suggesting your parser would avoid. Are you asking us to believe that your code will be robust enough to trust over code that's been used in production systems for years?
I think you need to stop getting distracted by details, and focus on your stated initial request "Whether we want to have a (possibly private) parsing library in the stdlib". You don't seem to me to have persuaded anyone of this basic suggestion yet,
Good observation. How do I convince you that complex input validation tasks should be left to a parser? Thanks! Nam
On Thu, 25 Jul 2019 at 15:53, Nam Nguyen <bitsink@gmail.com> wrote:
You need to start by getting agreement on the premise that adding a newly-written parser to the stdlib is a good idea. And so far your *only* argument seems to be that "it will avoid a class of security bugs" which I find extremely unconvincing (and I get the impression others do, too).
Why? What is unconvincing about a parsing library being able... parse (and therefore, validate) inputs?
Nothing. But why replace existing, working code, with brand new, untried code? And in particular in security sensitive areas? We have parsing code, it was just written by hand rather than by using a library (which likely didn't exist at the time). But it's had literally years of testing in real world situations by now, so any problems from having written it by hand have probably been dealt with by now. Let me turn your question back on you. What is difficult to understand about the idea that replacing existing, working code is a risk, and needs to be justified?
I think you need to stop getting distracted by details, and focus on your stated initial request "Whether we want to have a (possibly private) parsing library in the stdlib". You don't seem to me to have persuaded anyone of this basic suggestion yet,
Good observation. How do I convince you that complex input validation tasks should be left to a parser?
You don't have to - I'm completely convinced of that fact. How do *I* convince you that replacing existing, working code needs a better justification than "we should have done it this way originally"? Paul
Nam, I think it'd be better to frame the proposal as a security enhancement. Stating some of the common bugs/gotchas found when manually implementing parsers, and the impact this has had on python over the years. Seeing a full list of security issues (CVEs) by module would give us a sense of how widespread the problem is. Then survey the stdlib for what kind of grammars are currently being parsed, what ad-hoc parsing strategy are implemented and provide examples of whether having a general purpose parser would have prevented the security issues you have previously cited. Right now, it is not clear what the impact of such refactor would be, nor the worth of such attempt. What others have said earlier is that you are the one that needs to provide some of the requirements for the proposed private parsing library. And from what I read from your emails you do have some ideas. For example, you want it to be easy to write and review (I guess here you would eventually like it to be a close translation from whatever is specified in the RFC or grammar specification). But you also need to take into consideration some of the list's concerns, the parser library has to be performant, as a performance regression is likely not to be tolerable. On Thu, Jul 25, 2019 at 10:54 AM Nam Nguyen <bitsink@gmail.com> wrote:
On Thu, Jul 25, 2019 at 2:32 AM Paul Moore <p.f.moore@gmail.com> wrote:
On Thu, 25 Jul 2019 at 02:16, Nam Nguyen <bitsink@gmail.com> wrote:
Back to my original requests to the list: 1) Whether we want to have a (possibly private) parsing library in the stdlib
In the abstract, no. Propose a specific library, and that answer would change to "maybe".
I have no specific library to propose. I'm looking for a list of features such a library should have.
and 2) What features it should have.
That question only makes sense if you get agreement to the abstract proposal that "we should add a parsing library. And as I said, I don't agree to that so I can't answer the second question.
As Chris summarized it correctly, I am advocating for a general solution to individual problems (which have the same nature). We can certainly solve the problems when they are reported, or we can take a proactive approach to make them less likely to occur. I am talking about a class of input validation issues here and I thought parsing would be a very natural solution to that. This is quite similar to a context-sensitive templating library that prevents cross-site-scripting on the output side. So I don't know why (or what it takes) to convince people that it's a good thing(tm).
Generally, things go into the stdlib when they have been developed externally and proved their value. The bar for designing a whole library from scratch, "specifically" targeted at stdlib inclusion, is very high, and you're nowhere near reaching it IMO.
This is a misunderstanding. I have not proposed any from-scratch, or existing library to be used. And on this note, please allow me to make it clear once more time that I am not asking for a publicly-facing library either.
These are good points to set as targets! What does it take for me to get the list to agree on one such set of criteria?
You need to start by getting agreement on the premise that adding a newly-written parser to the stdlib is a good idea. And so far your *only* argument seems to be that "it will avoid a class of security bugs" which I find extremely unconvincing (and I get the impression others do, too).
Why? What is unconvincing about a parsing library being able... parse (and therefore, validate) inputs?
But even if "using a real parser" was useful in that context, there's *still* no argument for writing one from scratch, rather than using an existing, proven library.
Never a goal.
At the most basic level, what if there's a bug in your new parsing library? If we're using it in security-critical code, such a bug would be a vulnerability just like the ones you're suggesting your parser would avoid. Are you asking us to believe that your code will be robust enough to trust over code that's been used in production systems for years?
I think you need to stop getting distracted by details, and focus on your stated initial request "Whether we want to have a (possibly private) parsing library in the stdlib". You don't seem to me to have persuaded anyone of this basic suggestion yet,
Good observation. How do I convince you that complex input validation tasks should be left to a parser?
Thanks! Nam
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FCPU4Z... Code of Conduct: http://python.org/psf/codeofconduct/
-- Sebastian Kreft
On Tue, Jul 16, 2019 at 09:17:07PM -0700, Nam Nguyen wrote:
Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker?
That seems pretty slow for today's PCs, and quite a regression compared to the existing urllib.parse.urlparse function. On my old PC, I can parse a million URLs in 11 seconds, call it ten microseconds each: py> from timeit import Timer py> setup = "from urllib.parse import urlparse; s = 'https://www.google.com.au/foo/bar/web.html?a=b&c=d'" py> t = Timer("urlparse(s)", setup=setup) py> t.repeat() [11.07607395760715, 11.11838942207396, 11.057457006536424] If you run the same timeit benchmark, you'll get a rough idea of the relative speed of your computer compared to mine; that will allow you to extrapolate your "1 millisecond" per URL onto my PC. -- Steven
On Mon, Jul 15, 2019 at 8:47 PM Andrew Barnert <abarnert@yahoo.com> wrote:
On Jul 15, 2019, at 18:44, Nam Nguyen <bitsink@gmail.com> wrote:
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
For at least some of your use cases, I don’t think it’s a problem that it’s 70x slower than the custom parsers you’d be replacing. How often do you need to parse a million URLs in your inner loop? Also, if the function composition is really the performance hurdle, can you optimize that away relatively simply, just by building an explicit tree (expression-template style) and walking the tree in a __call__ method, rather than building an implicit tree of nested calls? (And that could be optimized further if needed, e.g. by turning the tree walk into a simple virtual machine where all of the fundamental operations are inlined into the loop, and maybe even accelerating that with C code.)
But I do think it’s a problem that there seems to be no way to usefully indicate failure to the caller, and I’m not sure that could be fixed as easily.
An empty set signifies the parse has failed. Perhaps I have misunderstood what you indicated here.
Invalid inputs in your readme examples don’t fail, they successfully return an empty set.
Because the library supports ambiguity, it can return more than one parse results. The guarantee here is if it returns an empty set, the parse has failed.
There also doesn’t seem to be any way to trigger a hard fail rather than a backtrack.
You can have a parser that raises an exception. None of the primitive parsers do that though.
So I’m not sure how a real urlparse replacement could do the things the current one does, like raising a ValueError on https://abc.d[ef.ghi/ complaining that the netloc looks like an invalid IPv6 address. (Maybe you could def a function that raises a ValueError and attach it as a where somewhere in the parser tree? But even if that works, wouldn’t you get a meaningless exception that doesn’t have any information about where in the source text or where in the parse tree it came from or why it was raised, and, as your readme says, a stack trace full of garbage?)
urlparse right now raises ValueError('Invalid IPv6 URL'). It does not mention where in the source text the error comes from.
Can you add failure handling without breaking the “~200LOC and easy to read” feature of the library, and without breaking the “easy to read once you grok parser combinators” feature of the parsers built with it?
This is a good request. I will have to play around with this idea more. What I think could be the most challenging task is to attribute failure to appropriate rule(s) (i.e. expr expects a term + term, but you only have term +). I feel like some metadata about the grammar might be required here, and that might be too unwieldy to provide in a parser combinator formulation. Interestingly enough, regex doesn't have anything like this either. Cheers, Nam
On Jul 16, 2019, at 21:30, Nam Nguyen <bitsink@gmail.com> wrote:
Can you add failure handling without breaking the “~200LOC and easy to read” feature of the library, and without breaking the “easy to read once you grok parser combinators” feature of the parsers built with it? This is a good request. I will have to play around with this idea more. What I think could be the most challenging task is to attribute failure to appropriate rule(s) (i.e. expr expects a term + term, but you only have term +). I feel like some metadata about the grammar might be required here, and that might be too unwieldy to provide in a parser combinator formulation.
For what it’s worth, the only time I played with parser combinators in anger, something like 90% of the final code was for tracking the source position and tree path and other state. I’m hoping you can come up with something more clever than we did, so your 200 lines of pretty code doesn’t turn into 2000 lines of ugly spaghetti.
Interestingly enough, regex doesn't have anything like this either.
Sure, but regex is meant to be dumb, restricted, and fast. Something that’s meant to parse arbitrary, arbitrarily-structured languages needs more failure handling. Also, regex does have (some) handing for errors in the regex, as opposed to parse failures, and I think your library will probably need at least that much. And, without the separate “compile” and “eval” stages that most regex libraries have, I don’t think you can really separate errors from failure the same way, so you kind of need failure handling just for debugging parsers. Of course I may be wrong in thinking that this is the more important limitation, rather than performance. But it’s also the more fun limitation to solve, right? :)
On 17 Jul 2019, at 06:30, Nam Nguyen <bitsink@gmail.com> wrote:
On Mon, Jul 15, 2019 at 8:47 PM Andrew Barnert <abarnert@yahoo.com <mailto:abarnert@yahoo.com>> wrote: On Jul 15, 2019, at 18:44, Nam Nguyen <bitsink@gmail.com <mailto:bitsink@gmail.com>> wrote:
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator <https://gitlab.com/nam-nguyen/parser_compynator> that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
For at least some of your use cases, I don’t think it’s a problem that it’s 70x slower than the custom parsers you’d be replacing. How often do you need to parse a million URLs in your inner loop? Also, if the function composition is really the performance hurdle, can you optimize that away relatively simply, just by building an explicit tree (expression-template style) and walking the tree in a __call__ method, rather than building an implicit tree of nested calls? (And that could be optimized further if needed, e.g. by turning the tree walk into a simple virtual machine where all of the fundamental operations are inlined into the loop, and maybe even accelerating that with C code.)
But I do think it’s a problem that there seems to be no way to usefully indicate failure to the caller, and I’m not sure that could be fixed as easily. An empty set signifies the parse has failed. Perhaps I have misunderstood what you indicated here.
This is pretty terrible. The empty set shouldn’t be the only feedback for errors. It reminds me of C’s atoi that converts a string to an integer, using 0 to indicate failure. But zero is of course a valid number (and a super common one at that!). Error messages with parsing isn’t a nice-to-have, it’s absolutely critical. / Anders
Have you looked into pyparsing (https://github.com/pyparsing/pyparsing)? It somehow looks relevant. On Mon, Jul 15, 2019 at 6:45 PM Nam Nguyen <bitsink@gmail.com> wrote:
Hello list,
I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).
Thanks! Nam _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2WFZPW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
Yes, I have. PyParsing was the first one I turned too for it has been available for a very long time. I emailed the author, Paul McGuire, a few times about this python-ideas thread too but never got a response. On Fri, Jul 19, 2019 at 9:36 AM Guido van Rossum <guido@python.org> wrote:
Have you looked into pyparsing (https://github.com/pyparsing/pyparsing)? It somehow looks relevant.
On Mon, Jul 15, 2019 at 6:45 PM Nam Nguyen <bitsink@gmail.com> wrote:
Hello list,
I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).
Thanks! Nam _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2WFZPW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
But regardless of the author's availability, do you think it would serve your purpose? How does it compare to your own library? Maybe you could compare the timings? On Fri, Jul 19, 2019 at 6:33 PM Nam Nguyen <bitsink@gmail.com> wrote:
Yes, I have. PyParsing was the first one I turned too for it has been available for a very long time. I emailed the author, Paul McGuire, a few times about this python-ideas thread too but never got a response.
On Fri, Jul 19, 2019 at 9:36 AM Guido van Rossum <guido@python.org> wrote:
Have you looked into pyparsing (https://github.com/pyparsing/pyparsing)? It somehow looks relevant.
On Mon, Jul 15, 2019 at 6:45 PM Nam Nguyen <bitsink@gmail.com> wrote:
Hello list,
I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).
Thanks! Nam _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2WFZPW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
On Fri, Jul 19, 2019 at 8:01 PM Guido van Rossum <guido@python.org> wrote:
But regardless of the author's availability, do you think it would serve your purpose?
Yes, it would serve the goal here, though I would love to see a much reduced set of features, and support for context sensitive grammars.
How does it compare to your own library?
First of all pyparsing has been around for 15+ years. It is often *the* parsing library Python programmers turn to. My library has enjoyed life for about 2 weeks. Secondly, on feature set that is deemed useful here, pyparsing has some failure / exception and source location support. For example: pyparsing.ParseException: Expected {{["-"]... Re:('[+-]?\\d+(?:\\.\\d*)?(?:[eE][+-]?\\d+)?')} | Group:({Suppress:("(") Forward: ... Suppress:(")")})} (at char 1), (line:1, col:2) Pyparsing also ignores whitespace by default. These features are not available in my library. Finally, pyparsing has some limitations of its own. Some of them are inability to handle recursive grammars, ambiguities, and context sensitive grammars. Maybe you could compare the timings?
I've just pushed out https://gitlab.com/nam-nguyen/parser_compynator/commit/40d41e6acc61f72184726... to do that. Pyparsing seems to be about 2.5 times faster than my library for a simple +-*/ grammar. The commit should give you some feel for how similar the two are. Cheers, Nam
On Fri, Jul 19, 2019 at 6:33 PM Nam Nguyen <bitsink@gmail.com> wrote:
Yes, I have. PyParsing was the first one I turned too for it has been available for a very long time. I emailed the author, Paul McGuire, a few times about this python-ideas thread too but never got a response.
On Fri, Jul 19, 2019 at 9:36 AM Guido van Rossum <guido@python.org> wrote:
Have you looked into pyparsing (https://github.com/pyparsing/pyparsing)? It somehow looks relevant.
On Mon, Jul 15, 2019 at 6:45 PM Nam Nguyen <bitsink@gmail.com> wrote:
Hello list,
I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).
Thanks! Nam _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2WFZPW... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him/his **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
The documentation is beautiful. One of the features I was looking for when evaluating parsers was the ability to run expression on the rules. For example if you matching "\begin{\w*}" and \w* turns out to be "enumeration", then later when you match "\end{\w*}" then you want to check that that \w* is also enumeration or else raise an error. A similar thing happens when you're trying to parse Python source code with the indentation level. You might want to check that the next indentation level is the same or corresponds to a dedent. Expressions on rules should be able to control whether something matches, should be able to store values in the parse tree, to store values that can be read by other expressions, and be able to raise parsing errors. The beauty of expressions is that you can do the parsing and build the AST in one shot. If you've ever looked at the Python source code, it is unfortunate that those tasks have to be done separately even though most changes to the AST require parsing changes. The most modern parsing algorithms have this. The old parsing libraries (lex/yacc, flex/bison, antlr) were very limited. Also, I'm not sure the separation between tokenization and parsing is necessary if you're not worried about efficiency. Best, Neil On Monday, July 15, 2019 at 9:45:59 PM UTC-4, Nam Nguyen wrote:
Hello list,
I sent an email to this list two or three months ago about the same idea. In that discussion, there were both skepticism and support. Since I had some time during the previous long weekend, I have made my idea more concrete and I thought I would try with the list again, after having run it through some of you privately.
GOAL: To have some parsing primitives in the stdlib so that other modules in the stdlib itself can make use of. This would alleviate various security issues we have seen throughout the years.
With that goal in mind, I opine that any parsing library for this purpose should have the following characteristics:
#. Can be expressed in code. My opinion is that it is hard to review generated code. Code review is even more crucial in security contexts.
#. Small and verifiable. This helps build trust in the code that is meant to plug security holes.
#. Less evolving. Being in the stdlib has its drawback that is development velocity. The library should be theoretically sound and stable from the beginning.
#. Universal. Most of the times we'll parse left-factored context-free grammars, but sometimes we'll also want to parse context-sensitive grammars such as short XML fragments in which end tags must match start tags.
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
Finally, my request to the list is: Please debate on: 1) whether we want a small (even private, underscore prefixed) parsing library in the stdlib to help with tasks that are a little too complex for regexes, and 2) if yes, how should it look like?
I also welcome comments (naming, uses of operator overloading, features, bikeshedding, etc.) on the above package ;).
Thanks! Nam
participants (12)
-
Anders Hovmöller
-
Andrew Barnert
-
Barry
-
Barry Scott
-
Chris Angelico
-
Eric V. Smith
-
Guido van Rossum
-
Nam Nguyen
-
Neil Girdhar
-
Paul Moore
-
Sebastian Kreft
-
Steven D'Aprano