[Python-Dev] Some more comments re new uriparse module, patch 1462525

Sun Jun 4 03:09:38 CEST 2006

On Friday, Jun 2, 2006, John J Lee writes:
>[Not sure whether this kind of thing is best posted as tracker comments 
>(but then the tracker gets terribly long and is mailed out every time a 
>change happens) or posted here.  Feel free to tell me I'm posting in the 
>wrong place...]

I think this is a fine place - more googleable, still archived, etc.

>Some comments on this patch (a new module, submitted by Paul Jimenez, 
>implementing the rules set out in RFC 3986 for URI parsing, joining URI 
>references with a base URI etc.)
>
>http://python.org/sf/1462525

Note that like many opensource authors, I wrote this to 'scratch an
itch' that I had... and am submitting it in hopes of saving someone else
somewhere some essentially identical work. I'm not married to it; I just
want something *like* it to end up in the stdlib so that I can use it.

>Sorry for the pause, Paul.  I finally read RFC 3986 -- which I must say is 
>probably the best-written RFC I've read (and there was much rejoicing).

No worries.  Yeah, the RFC is pretty clear (for once) :)

>I still haven't read 3987 and got to grips with the unicode issues 
>(whatever they are), but I have just implemented the same stuff you did, 
>so have some comments on non-unicode aspects of your implementation (the 
>version labelled "v23" on the tracker):
>
>
>Your urljoin implementation seems to pass the tests (the tests taken from 
>the RFC), but I have to I admit I don't understand it :-)  It doesn't seem 
>to take account of the distinction between undefined and empty URI 
>components.  For example, the authority of the URI reference may be empty 
>but still defined.  Anyway, if you're taking advantage of some subtle 
>identity that implies that you can get away with truth-testing in place of 
>"is None" tests, please don't ;-) It's slower than "is [not] None" tests 
>both for the computer and (especially!) the reader.

First of all, I must say that urljoin is my least favorite part of this
module; I include it only so as not to break backward compatibility -
I don't have any personal use-cases for such. That said, some of the
'join' semantics are indeed a bit subtle; it took a bit of tinkering to
make all the tests work. I was indeed using 'if foo:' instead of 'if
foo is not None:', but that can be easily fixed; I didn't know there
was a performance issue there. Stylistically I find them about the same
clarity-wise.

>I don't like the use of module posixpath to implement the algorithm 
>labelled "remove_dot_segments".  URIs are not POSIX filesystem paths, and 
>shouldn't depend on code meant to implement the latter.  But my own 
>implementation is exceedingly ugly ATM, so I'm in no position to grumble 
>too much :-)

While URIs themselves are not, of course, POSIX filesystem paths, I believe
there's a strong case that their path components are semantically identical
in this usage.  I see no need to duplicate code that I know can be fairly
tricky to get right; better to let someone else worry about the corner cases
and take advantage of their work when I can.

>Normalisation of the base URI is optional, and your urljoin function
>never normalises.  Instead, it parses the base and reference, then
>follows the algorithm of section 5.2 of the RFC.  Parsing is required
>before normalisation takes place.  So urljoin forces people who need
>to normalise the URI before to parse it twice, which is annoying.
>There should be some way to parse 5-tuples in instead of URIs.  E.g.,
>from my implementation:
>
>def urljoin(base_uri, uri_reference):
>     return urlunsplit(urljoin_parts(urlsplit(base_uri),
>                                     urlsplit(uri_reference)))
>

It would certainly be easy to add a version which took tuples instead
of strings, but I was attempting, as previously stated, to conform to the
extant urlparse.urljoin API for backward compatability.  Also as I previously
stated, I have no personal use-cases for urljoin so the issue of having to
double-parse if you do normalization never came to my attention.

>It would be nice to have a 5-tuple-like class (I guess implemented as a 
>subclass of tuple) that also exposes attributes (.authority, .path, etc.) 
>-- the same way module time does it.

That starts to edge over into a 'generic URI' class, which I'm uncomfortable
with due to the possibility of opaque URIs that don't conform to that spec.
The fallback of putting everthing other than the scheme into 'path' doesn't
appeal to me.

>The path component is required, though may be empty.  Your parser
>returns None (meaning "undefined") where it should return an empty
>string.

Indeed.  Fixed now; a fresh look at the code showed me where the mistakes
that made that seem necessary lay.

>Nit: Your tests involving ports contain non-digit characters in the
>port (viz. "port"), which is not valid by section 3.2.3 of the RFC.

Indeed.  Nit fixed.

>Smaller nit: the userinfo component was never allowed in http URLs,
>but you use them in your tests.  This issue is outside of RFC 3986, of
>course.

Actually it was allowed in http URLs as an alternate method of
specifying HTTP AUTH, but is indeed now deprecated; I suspect, due to
the prevalence of semantic attacks.

>Particularly because the userinfo component is deprecated, I'd rather
>that userinfo-splitting and joining were separate functions, with the
>other functions dealing only with the standard RFC 3986 5-tuples.

It's only deprecated for http; it's quite the convenience for other
protocols.

>DefaultSchemes should be a class attribute of URIParser

Got any good reasoning behind that?  I don't have any real
reason to keep it a module variable other than 
it vaguely feels too 'heavy' for a class attribute to me.
Maybe we can get a second opinion?

>The naming of URLParser / URIParser is still insane :-)  I suggest
>naming them _URIParser and URIParser respectively.

I dunno aobut insane; as I say in the code:

  URI generally refers to generic URIs and URL refers to
  to URIs that match scheme://user:password@host:port/path?query#fragment.

I suppose another way to say it is that I consider a URI to be more
opaque than a URL - my interpretation of section 1.1.3 of rfc3986 which
says:

   A URI can be further classified as a locator, a name, or both.  The
   term "Uniform Resource Locator" (URL) refers to the subset of URIs
   that, in addition to identifying a resource, provide a means of
   locating the resource by describing its primary access mechanism
   (e.g., its network "location"). 

>I guess there should be no mention of "URL" anywhere in the module --
>only "URI" (even though I hate "URI", as a mostly-worthless
>distinction from "URL", consistency inside the module is more
>important, and URI is technically correct and fits with all the
>terminology used in the RFC).  I'm still heavily -1 on calling it
>"uriparse" though, because of the highly misleading comparison with
>the name "urlparse" (the difference between the modules isn't the
>difference between URIs and URLs).

uriparse is more generic and extensible than urlparse is; just as uris
are to urls.  Honestly I don't care, I was just trying to come up with a
name that would be distinct enough to not cause confusion, wasn't too
inelegant (eg. urlparse2), and was appropriate enough that a future use
would look at it.  I'm open to suggestions.

>Re your comment on "mailto:" in the tracker: sure, I understand it's not 
>meant to be public, but the interface is!  .parse() will return a 4-tuple 
>for mailto: URLs.  For everything else, it will return a 7-tuple.  That's 
>silly.

That's URIs: the length of the tuple returned is scheme-dependent.  The
alternative is to provide less functionality by lumping everything after
the : into a single 'path' component that the programmer will then have to
parse again anyway.  This way they don't have to do that extra work.

>The documentation should explain that the function of URIParser is
>hiding scheme-dependent URI normalisation.

Its function is twofold: to allow hiding and to provide an extensible
framework with a uniform interface.  But yes, the documentation should
be better.

>Method names and locals are still likeThis, contrary to PEP 8.

Fixed most places, I think.

>docstrings and other whitespace are still non-standard -- follow PEP 8
>(and PEP 257, which PEP 8 references) Doesn't have to be totally rigid
>of course -- e.g. lining up the ":" characters in the tests is fine.

I've given the whole thing a once-over to try and come up to spec, but 
I'm sure I missed something, so feel free to point out anything I missed.

>Standard stdlib form documentation is still missing.  I'll be told off
>if I don't read you your rights: you don't have to submit in LaTeX
>markup -- apparently there are hordes of eager LaTeX markers-up
>lurking ready to pounce on poorly-formatted documentation <wink>
>
>Test suite still needs tweaking to put it in standard stdlib form

Excuse my ignorance, but is there a PEP somewhere for 'standard stdlib
form' ?  What do I need to do to bring it up to snuff?

  --pj