[Python-ideas] Re: Universal parsing library in the stdlib to alleviate security issues

July 18, 2019

      On Thu, Jul 18, 2019 at 12:07 AM Barry Scott <barry@barrys-emacs.org> wrote:
...
On 18 Jul 2019, at 05:23, Nam Nguyen <bitsink@gmail.com> wrote:
On Wed, Jul 17, 2019 at 12:38 AM Barry Scott <barry@barrys-emacs.org>
wrote:
...
But if your use cases call for performance, it is perfectly fine to
understand the tradeoffs, and opt in to the more appropriate solutions.
And, of course, maybe there is a solution that could satisfy *both*.
Generally speaking, though, do you see 1 millisecond spent on parsing a
URL deal breaker? I sense that some web frameworks might not like that very
much, but I don't have any concrete use case to quote.
Yes 1ms would be a serious issue.
I guess what I'm concerned about is the use of a universal parser for a
benefit I'm not clear exists having a terrible affect of the speed of code
that is secure and correct.
I hope you are doubting that *my* library has not proven itself yet,
rather than the benefit if a proper parsing package. If it was about the
second, perhaps these links could convince you that URL parsing was simply
not "secure and correct":
https://www.cvedetails.com/cve/CVE-2019-10160/ urlsplit and urlparse
regress from the fix below.
https://www.cvedetails.com/cve/CVE-2019-9636/ urlsplit and urlparse vs
unicode normalization.
https://bugs.python.org/issue30500 urlparse's handling of # in password
That is a correlation, as Chris said, show causation.
There has been no proof so far. And this is actually good discussion
because it now focuses on how a parsing library might help, not particular
about my lousy implementation ;).
...
I'm working on the impact of CVE-2019-9636 as part of my day-job.
I'd be interesting in your analysis of how you parsing proposal would have
avoided this problem before it was described.
I add the "before it was described" because I think you are claiming that
the universal parser will prevent this class of
security issue by the nature of its design.
That's exactly what I'd like the list to consider. I claim that a parsing
library would prevent this class of issues simply because it follows the
specification to the letter. As you'll find in my code, the grammar is
trivial translation of what is in the spec, one for one. Not only will this
help explain what the code is doing, but also raise the confidence that the
code matches the spec. Full disclosure, I fixed bpo30500, but I was not at
all confident that the fix was correct. I understood what the code did; I
just couldn't tell if that was the right thing to do.
...
And many more related to HTTP header, cookie, email... These things are
tricky.
Parsing of HTTP headers is not that hard. Have a look at the code in
twisted library that handles all this pretty well.
Where HTTP gets complex is the semantic meaning of the headers. Where
performance matters is when handling
very large HTTP messages. We see 1GiB POST's to backup servers that
contain 1 million small files mime
encoded in them, and python can handle that well enough. But not with a
x70 slow down.
When performance matters, there are alternatives. For example, XML parsing
can be done with elementtree. And someone smarter than me can implement a
much faster parsing library to accommodate both security and performance.
...
It is fair to use the numbers I currently have (~900 usec vs 13 usec per
parse) to ballpark the impact but I'm pretty confident that performance
will sort itself out acceptably, eventually (and even with an entirely
different library).
If 1ms is a deal breaker, what is a more palatable latency?
You need to be very close to the 13us mark. For correctness and security I
will take the hit of the patch to CVE-2019-9636, which will be a few
microseconds at most.
I have to admit there's a long way before my package could get there, or
ever. But I won't be surprise when some alternative parsing libraries can
deliver that.
...
How much latency would you trade for security?
Please show your approach can deliver that extra security.
I am using same inputs in those references, and not changing anything in
the parser:

bpo30500:  'http://127.0.0.1#@evil.com/' is parsed as scheme http, host
127.0.0.1, fragment @evil.com, as expected. Path is an empty string, query
is None.

CVE-2019-9636 (aka bpo36216): 'https://example.com\uFF03@bing.com' parses
incompletely to host example.com, leaves '\uff03@bing.com' in remain. The
normalized string parses to host example.com, and fragment @bing.com. In
both cases, the scheme is https. Since this works correctly, there is no
CVE-2019-10160.

Furthermore, it should be noted here that query is correctly marked as
None, instead of empty string.

Those are trivial examples and won't be able to *really* prove anything
just yet, but at least they show this approach is worth looking into, that
it could have prevented issues before they become real.

Cheers,
Nam