On Thu, Jul 18, 2019 at 12:07 AM Barry Scott <barry@barrys-emacs.org> wrote:

On 18 Jul 2019, at 05:23, Nam Nguyen <bitsink@gmail.com> wrote:

On Wed, Jul 17, 2019 at 12:38 AM Barry Scott <barry@barrys-emacs.org> wrote:
But if your use cases call for performance, it is perfectly fine to understand the tradeoffs, and opt in to the more appropriate solutions. And, of course, maybe there is a solution that could satisfy *both*.

Generally speaking, though, do you see 1 millisecond spent on parsing a URL deal breaker? I sense that some web frameworks might not like that very much, but I don't have any concrete use case to quote.

Yes 1ms would be a serious issue.

I guess what I'm concerned about is the use of a universal parser for a benefit I'm not clear exists having a terrible affect of the speed of code that is secure and correct.

I hope you are doubting that *my* library has not proven itself yet, rather than the benefit if a proper parsing package. If it was about the second, perhaps these links could convince you that URL parsing was simply not "secure and correct":

https://www.cvedetails.com/cve/CVE-2019-10160/ urlsplit and urlparse regress from the fix below.
https://www.cvedetails.com/cve/CVE-2019-9636/ urlsplit and urlparse vs unicode normalization.
https://bugs.python.org/issue30500 urlparse's handling of # in password

That is a correlation, as Chris said, show causation.

There has been no proof so far. And this is actually good discussion because it now focuses on how a parsing library might help, not particular about my lousy implementation ;).

I'm working on the impact of CVE-2019-9636 as part of my day-job.

I'd be interesting in your analysis of how you parsing proposal would have avoided this problem before it was described.
I add the "before it was described" because I think you are claiming that the universal parser will prevent this class of
security issue by the nature of its design.

That's exactly what I'd like the list to consider. I claim that a parsing library would prevent this class of issues simply because it follows the specification to the letter. As you'll find in my code, the grammar is trivial translation of what is in the spec, one for one. Not only will this help explain what the code is doing, but also raise the confidence that the code matches the spec. Full disclosure, I fixed bpo30500, but I was not at all confident that the fix was correct. I understood what the code did; I just couldn't tell if that was the right thing to do.

And many more related to HTTP header, cookie, email... These things are tricky.

Parsing of HTTP headers is not that hard. Have a look at the code in twisted library that handles all this pretty well.
Where HTTP gets complex is the semantic meaning of the headers. Where performance matters is when handling
very large HTTP messages. We see 1GiB POST's to backup servers that contain 1 million small files mime
encoded in them, and python can handle that well enough. But not with a x70 slow down.

When performance matters, there are alternatives. For example, XML parsing can be done with elementtree. And someone smarter than me can implement a much faster parsing library to accommodate both security and performance.

It is fair to use the numbers I currently have (~900 usec vs 13 usec per parse) to ballpark the impact but I'm pretty confident that performance will sort itself out acceptably, eventually (and even with an entirely different library).

If 1ms is a deal breaker, what is a more palatable latency?

You need to be very close to the 13us mark. For correctness and security I will take the hit of the patch to CVE-2019-9636, which will be a few microseconds at most.

I have to admit there's a long way before my package could get there, or ever. But I won't be surprise when some alternative parsing libraries can deliver that.

How much latency would you trade for security?

Please show your approach can deliver that extra security.

I am using same inputs in those references, and not changing anything in the parser:

bpo30500:  '' is parsed as scheme http, host, fragment @evil.com, as expected. Path is an empty string, query is None.

CVE-2019-9636 (aka bpo36216): 'https://example.com\uFF03@bing.com' parses incompletely to host example.com, leaves '\uff03@bing.com' in remain. The normalized string parses to host example.com, and fragment @bing.com. In both cases, the scheme is https. Since this works correctly, there is no CVE-2019-10160.

Furthermore, it should be noted here that query is correctly marked as None, instead of empty string.

Those are trivial examples and won't be able to *really* prove anything just yet, but at least they show this approach is worth looking into, that it could have prevented issues before they become real.