[Python-Dev] Path object design

Sun Nov 5 23:29:13 CET 2006

Martin:
> Unfortunately, you didn't say which of these you want explained.
> As it is tedious to write down even a single one, I restrain to the
> one with the What?! remark.
>
> >>>> urlparse.urljoin("http://blah.com/a/b/c", "../../../..")  # What?!
> > 'http://blah.com/'

The "What?!" is in context with the previous and next entries.  I've
reduced it to a simpler case

>>> urlparse.urljoin("http://blah.com/", "..")
'http://blah.com/'
>>> urlparse.urljoin("http://blah.com/", "../")
'http://blah.com/../'
>>> urlparse.urljoin("http://blah.com/", "../..")
'http://blah.com/'

Does the result make sense to you?  Does it make
sense that the last of these is shorter than the middle
one?  It sure doesn't to me.  I thought it was obvious
that there was an error; obvious enough that I didn't
bother to track down why - especially as my main point
was to argue there are different ways to deal with
hierarchical/path-like schemes, each correct for its
given domain.

> Please follow me through section 5 of
>
> http://www.ietf.org/rfc/rfc3986.txt

The core algorithm causing the "what?!" comes from
"reduce_dot_segments", section 5.2.4.  In parallel my
3 cases should give:

5.2.4 Remove Dot Segments
 remove_dot_segments("/..")    r_d_s("/../")    r_d_s("/../..")

 1. I = "/.."               I="/../"            I="/../.."
    O = ""                  O=""                O=""
 2A. (does not apply)     2A. (does not apply)  2A. (does not apply)
 2B. (does not apply)     2B. (does not apply)  2B. (does not apply)
 2C. O="" I="/"           2C. O="" I="/"        2C. O="" I="/.."
 2A. (does not apply)     2A. (does not apply)   .. reduces to r_d_s("/..")
 2B. (does not apply)     2B. (does not apply)  3. Result "/"
 2C. (does not apply)     2C. (does not apply)
 2D. (does not apply)     2D. (does not apply)
 2E. O="/", I=""          2E. O="/", I=""
 3. Result: "/"           3. Result "/"

My reading of the RFC 3986 says all three examples should
produce the same result.  The fact that my "what?!" comment happens
to be correct according to that RFC is purely coincidental.

Then again, urlparse.py does *not* claim to be RFC 3986 compliant.
The module docstring is

"""Parse (absolute and relative) URLs.

See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding,
UC Irvine, June 1995.
"""

I tried the same code with 4Suite, which does claim compliance, and get

>>> import Ft
>>> from Ft.Lib import Uri
>>> Uri.Absolutize("..", "http://blah.com/")
'http://blah.com/'
>>> Uri.Absolutize("../", "http://blah.com/")
'http://blah.com/'
>>> Uri.Absolutize("../..", "http://blah.com/")
'http://blah.com/'
>>>

The text of it's Uri.py says

    This function is similar to urlparse.urljoin() and urllib.basejoin().
    Those functions, however, are (as of Python 2.3) outdated, buggy, and/or
    designed to produce results acceptable for use with other core Python
    libraries, rather than being earnest implementations of the relevant
    specs. Their problems are most noticeable in their handling of
    same-document references and 'file:' URIs, both being situations that
    come up far too often to consider the functions reliable enough for
    general use.
    """
    # Reasons to avoid using urllib.basejoin() and urlparse.urljoin():
    # - Both are partial implementations of long-obsolete specs.
    # - Both accept relative URLs as the base, which no spec allows.
    # - urllib.basejoin() mishandles the '' and '..' references.
    # - If the base URL uses a non-hierarchical or relative path,
    #    or if the URL scheme is unrecognized, the result is not
    #    always as expected (partly due to issues in RFC 1808).
    # - If the authority component of a 'file' URI is empty,
    #    the authority component is removed altogether. If it was
    #    not present, an empty authority component is in the result.
    # - '.' and '..' segments are not always collapsed as well as they
    #    should be (partly due to issues in RFC 1808).
    # - Effective Python 2.4, urllib.basejoin() *is* urlparse.urljoin(),
    #    but urlparse.urljoin() is still based on RFC 1808.

In searching the archives
  http://mail.python.org/pipermail/python-dev/2005-September/056152.html

Fabien Schwob:
> I'm using the module urlparse and I think I've found a bug in the
> urlparse module. When you merge an url and a link
> like"../../../page.html" with urljoin, the new url created keep some
> "../" in it. Here is an example :
>
>  >>> import urlparse
>  >>> begin = "http://www.example.com/folder/page.html"
>  >>> end = "../../../otherpage.html"
>  >>> urlparse.urljoin(begin, end)
> 'http://www.example.com/../../otherpage.html'

Guido:
> You shouldn't be giving more "../" sequences than are possible. I find
> the current behavior acceptable.

(Aparently for RFC 1808 that's a valid answer; it was an implementation
choice in how to handle that case.)

While not directly relevant, postings like John J Lee's
 http://mail.python.org/pipermail/python-bugs-list/2006-February/031875.html
> The urlparse.urlparse() code should not be changed, for
> backwards compatibility reasons.

strongly suggest a desire to not change that code.  The last
definitive statement on this topic that I could find was mentioned in
http://www.python.org/dev/summary/2005-11-16_2005-11-30/#updating-urlparse-to-support-rfc-3986
> Guido pointed out that the main purpose of urlparse is to be RFC-compliant.
> Paul explained that the current code is valid according to RFC 1808
> (1995-1998), but that this was superceded by RFC 2396 (1998-2004)
> and RFC 3986 (2005-). Guido was convinced, and asked for a new API
> (for backwards compatibility) and a patch to be submitted via sourceforge.

As this is not a bug, I have added the feature request 1591035 to SF
titled "update urlparse to RFC 3986".  Nothing else appeared to exist
on that specific topic.

                                Andrew
                                dalke at dalkescientific.com