[Python-Dev] Path object design

Sat Nov 4 01:56:39 CET 2006

Martin:
> Just in case this isn't clear from Steve's and Fredrik's
> post: The behaviour of this function is (or should be)
> specified, by an IETF RFC. If somebody finds that non-intuitive,
> that's likely because their mental model of relative URIs
> deviate's from the RFC's model.

While I didn't realize that urljoin is only supposed to be used
with a base URL, where "base URL" (used in the docstring) has
a specific requirement that it be absolute.

I instead saw the word "join" and figured it's should do roughly
the same things as os.path.join.

>>> import urlparse
>>> urlparse.urljoin("file:///path/to/hello", "slash/world")
'file:///path/to/slash/world'
>>> urlparse.urljoin("file:///path/to/hello", "/slash/world")
'file:///slash/world'
>>> import os
>>> os.path.join("/path/to/hello", "slash/world")
'/path/to/hello/slash/world'
>>>

It does not.  My intuition, nowadays highly influenced by URLs, is that
with a couple of hypothetical functions for going between filenames and URLs:

os.path.join(absolute_filename, filename)
   ==
file_url_to_filename(urlparse.urljoin(
     filename_to_file_url(absolute_filename),
     filename_to_file_url(filename)))

which is not the case.  os.join assumes the base is a directory
name when used in a join: "inserting '/' as needed" while RFC
1808 says

           The last segment of the base URL's path (anything
           following the rightmost slash "/", or the entire path if no
           slash is present) is removed

Is my intuition wrong in thinking those should be the same?

I suspect it is. I've been very glad that when I ask for a directory
name that I don't need to check that it ends with a "/".  Urljoin's
behaviour is correct for what it's doing.  os.path.join is better for
what it's doing.  (And about once a year I manually verify the
difference because I get unsure.)

I think these should not share the "join" in the name.

If urljoin is not meant for relative base URLs, should it
raise an exception when misused? Hmm, though the RFC
algorithm does not have a failure mode and the result may
be a relative URL.

Consider

>>> urlparse.urljoin("http://blah.com/a/b/c", "..")
'http://blah.com/a/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../")
'http://blah.com/a/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../../")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../../..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../../../")
'http://blah.com/../'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../../../..")  # What?!
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/a/b/c", "../../../../")
'http://blah.com/../../'
>>>

> Of course, there is also the chance that the implementation
> deviates from the RFC; that would be a bug.

The comment in urlparse

    # XXX The stuff below is bogus in various ways...

is ever so reassuring.  I suspect there's a bug given the
previous code.  Or I've a bad mental model.  ;)

                        Andrew
                        dalke at dalkescientific.com