Fwd: Re: Universal parsing library in the stdlib to alleviate security issues

Forward to the list because Abusix had blocked google.com initially. Nam ---------- Forwarded message --------- From: Nam Nguyen <bitsink@gmail.com> Date: Sun, Jul 28, 2019 at 10:18 AM Subject: Re: [Python-ideas] Re: Universal parsing library in the stdlib to alleviate security issues To: Sebastian Kreft <skreft@gmail.com> Cc: Paul Moore <p.f.moore@gmail.com>, python-ideas <python-ideas@python.org> Let's circle back to the beginning one last time ;). On Thu, Jul 25, 2019 at 8:15 AM Sebastian Kreft <skreft@gmail.com> wrote:
Since my final exam was done this weekend, I gathered some more info into this spreadsheet. https://docs.google.com/spreadsheets/d/1TlWSf8iM7eIzEPXanJAP8Ztyzt4ZD28xFvUK... I think a strict parser can help with the majority of those problems. They are in HTTP headers, emails, cookies, URLs, and even low level socket code (inet_atoi).
Most grammars I have seen here come straight from RFCs, which are in ABNF and thus context-free. Current implementations are based on regexes or string splitting. My previous example showed that at least 30500, 36216, 36742 were non-issues if we started out with a strict parser.
Right now, it is not clear what the impact of such refactor would be, nor the worth of such attempt.
Exactly the kind of response I'm looking for. It is okay to suggest that the benefits aren't clear or that there are requirements X and Y that a general parser won't be able to meet, but it's not convincing to brush aside this because there is "existing, working code." Many of the bugs in that sheet are still open. It's not comfortable to say the code is working with a straight face as I have experienced with my own fix for 30500. I just couldn't tell if it was doing the right thing.
Yes, that's the most important point because "readability counts." It's hard to reason about correctness when there are many transformations between the authoritative spec and the implementation. I definitely don't want to touch the regexes, string splits, and custom logic that I don't understand "why" they are that way in the beginning. How do I, for example, know what this regex is about ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? (It's from RFC 3986.) But you also need to take into consideration some of the list's concerns,
the parser library has to be performant, as a performance regression is likely not to be tolerable.
Absolutely. That's where I need inputs from the list. I have provided my own set of requirements for such a parser library. I'm sure most of us have different needs too. So if a parser library can help you, let's hear what you want from it. If you think it can't, please let me understand why. Thanks, Nam

Nam Nguyen writes:
This is useful!
Most grammars I have seen here come straight from RFCs,
Grammars are only truly relevant if you have a parser-compiler for them. Otherwise, you're still translating by hand. See following discussion of urlsplit, which has issues #30500 and #36216 etc.
For #30500, I don't recall you dealing with the claim that the parse is not for RFC 3986 URIs, but for a sort of "raw URI" which might not be percent-encoded. (AFAICT, RFC 3986 taken strictly restricts proper URIs to a subset of ASCII.) In that case, the translation would not be straightforward. This understanding is supported by #36216, which refers to IDNA. It's possible that urlsplit is intended to deal with RFC 3987 IRIs, but I tend to the view that it's just unclear. ;-)
If it's just a straight implementation of RFC 3986, that shouldn't be too hard.
By rewriting it as uri = r"""^(?: ([^:/?#]+): # group 1, optional scheme )?(?: //([^/?#]*) # group 2, optional authority )? ([^?#]*) # group 3, optional path (?: \?([^#]*) # group 4, optional query )?(?: #(.*) # group 5, optional fragment )?$""" (where I've revised it to use non-capturing groups and explicitly continue to end-of-string) and compiling with re.VERBOSE, or some similar device. An experienced Pythonista will likely do a double-take at group 3, then realize that "path" is a component that isn't delimited by a reserved character. Note that this regexp assumes a string containing no characters outside of the ASCII subset defined in the RFC! I'm not sure which is easier to read, this or the rather long grammar of RFC 3986 (or the even longer grammar of RFC 3897).
I believe you've already heard from the people on this list who care. Its members are active participants, mostly. Its lurkers are frequently core devs who figure if it gets traction they will give their input on -dev. I think a better place to take a poll like this would be python-list@python.org. Steve

Nam Nguyen writes:
This is useful!
Most grammars I have seen here come straight from RFCs,
Grammars are only truly relevant if you have a parser-compiler for them. Otherwise, you're still translating by hand. See following discussion of urlsplit, which has issues #30500 and #36216 etc.
For #30500, I don't recall you dealing with the claim that the parse is not for RFC 3986 URIs, but for a sort of "raw URI" which might not be percent-encoded. (AFAICT, RFC 3986 taken strictly restricts proper URIs to a subset of ASCII.) In that case, the translation would not be straightforward. This understanding is supported by #36216, which refers to IDNA. It's possible that urlsplit is intended to deal with RFC 3987 IRIs, but I tend to the view that it's just unclear. ;-)
If it's just a straight implementation of RFC 3986, that shouldn't be too hard.
By rewriting it as uri = r"""^(?: ([^:/?#]+): # group 1, optional scheme )?(?: //([^/?#]*) # group 2, optional authority )? ([^?#]*) # group 3, optional path (?: \?([^#]*) # group 4, optional query )?(?: #(.*) # group 5, optional fragment )?$""" (where I've revised it to use non-capturing groups and explicitly continue to end-of-string) and compiling with re.VERBOSE, or some similar device. An experienced Pythonista will likely do a double-take at group 3, then realize that "path" is a component that isn't delimited by a reserved character. Note that this regexp assumes a string containing no characters outside of the ASCII subset defined in the RFC! I'm not sure which is easier to read, this or the rather long grammar of RFC 3986 (or the even longer grammar of RFC 3897).
I believe you've already heard from the people on this list who care. Its members are active participants, mostly. Its lurkers are frequently core devs who figure if it gets traction they will give their input on -dev. I think a better place to take a poll like this would be python-list@python.org. Steve
participants (2)
-
Nam Nguyen
-
Stephen J. Turnbull