[Python-ideas] Fwd: Re: Universal parsing library in the stdlib to alleviate security issues

29 Jul 2019

      Nam Nguyen writes:
...
Since my final exam was done this weekend, I gathered some more info into
this spreadsheet.
https://docs.google.com/spreadsheets/d/1TlWSf8iM7eIzEPXanJAP8Ztyzt4ZD28xFvUK...
This is useful!
...
Most grammars I have seen here come straight from RFCs,
Grammars are only truly relevant if you have a parser-compiler for
them.  Otherwise, you're still translating by hand.  See following
discussion of urlsplit, which has issues #30500 and #36216 etc.
...
which are in ABNF and thus context-free. Current implementations
are based on regexes or string splitting. My previous example
showed that at least 30500, 36216, 36742 were non-issues if we
started out with a strict parser.
For #30500, I don't recall you dealing with the claim that the parse
is not for RFC 3986 URIs, but for a sort of "raw URI" which might not
be percent-encoded.  (AFAICT, RFC 3986 taken strictly restricts proper
URIs to a subset of ASCII.)  In that case, the translation would not
be straightforward.  This understanding is supported by #36216, which
refers to IDNA.

It's possible that urlsplit is intended to deal with RFC 3987 IRIs,
but I tend to the view that it's just unclear. ;-)
...
that sheet are still open. It's not comfortable to say the code is
working with a straight face as I have experienced with my own fix
for 30500. I just couldn't tell if it was doing the right thing.
If it's just a straight implementation of RFC 3986, that shouldn't be
too hard.
...
How do I, for example, know what this regex is about
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
from RFC 3986?
By rewriting it as

    uri = r"""^(?:
              ([^:/?#]+):        # group 1, optional scheme
              )?(?:
              //([^/?#]*)        # group 2, optional authority
              )?
              ([^?#]*)           # group 3, optional path
              (?:
              \?([^#]*)          # group 4, optional query
              )?(?:
              #(.*)              # group 5, optional fragment
              )?$"""

(where I've revised it to use non-capturing groups and explicitly
continue to end-of-string) and compiling with re.VERBOSE, or some
similar device.  An experienced Pythonista will likely do a
double-take at group 3, then realize that "path" is a component that
isn't delimited by a reserved character.  Note that this regexp
assumes a string containing no characters outside of the ASCII subset
defined in the RFC!

I'm not sure which is easier to read, this or the rather long grammar
of RFC 3986 (or the even longer grammar of RFC 3897).
...
Absolutely. That's where I need inputs from the list. I have
provided my own set of requirements for such a parser library. I'm
sure most of us have different needs too. So if a parser library
can help you, let's hear what you want from it.
I believe you've already heard from the people on this list who care.
Its members are active participants, mostly.  Its lurkers are
frequently core devs who figure if it gets traction they will give
their input on -dev.  I think a better place to take a poll like this
would be python-list@python.org.

Steve

[Python-ideas] Fwd: Re: Universal parsing library in the stdlib to alleviate security issues

Stephen J. Turnbull