Martin:
Unfortunately, you didn't say which of these you want explained. As it is tedious to write down even a single one, I restrain to the one with the What?! remark.
urlparse.urljoin("http://blah.com/a/b/c", "../../../..") # What?! 'http://blah.com/'
The "What?!" is in context with the previous and next entries. I've reduced it to a simpler case
urlparse.urljoin("http://blah.com/", "..") 'http://blah.com/' urlparse.urljoin("http://blah.com/", "../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/", "../..") 'http://blah.com/'
Does the result make sense to you? Does it make sense that the last of these is shorter than the middle one? It sure doesn't to me. I thought it was obvious that there was an error; obvious enough that I didn't bother to track down why - especially as my main point was to argue there are different ways to deal with hierarchical/path-like schemes, each correct for its given domain.
Please follow me through section 5 of
The core algorithm causing the "what?!" comes from "reduce_dot_segments", section 5.2.4. In parallel my 3 cases should give: 5.2.4 Remove Dot Segments remove_dot_segments("/..") r_d_s("/../") r_d_s("/../..") 1. I = "/.." I="/../" I="/../.." O = "" O="" O="" 2A. (does not apply) 2A. (does not apply) 2A. (does not apply) 2B. (does not apply) 2B. (does not apply) 2B. (does not apply) 2C. O="" I="/" 2C. O="" I="/" 2C. O="" I="/.." 2A. (does not apply) 2A. (does not apply) .. reduces to r_d_s("/..") 2B. (does not apply) 2B. (does not apply) 3. Result "/" 2C. (does not apply) 2C. (does not apply) 2D. (does not apply) 2D. (does not apply) 2E. O="/", I="" 2E. O="/", I="" 3. Result: "/" 3. Result "/" My reading of the RFC 3986 says all three examples should produce the same result. The fact that my "what?!" comment happens to be correct according to that RFC is purely coincidental. Then again, urlparse.py does *not* claim to be RFC 3986 compliant. The module docstring is """Parse (absolute and relative) URLs. See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995. """ I tried the same code with 4Suite, which does claim compliance, and get
import Ft from Ft.Lib import Uri Uri.Absolutize("..", "http://blah.com/") 'http://blah.com/' Uri.Absolutize("../", "http://blah.com/") 'http://blah.com/' Uri.Absolutize("../..", "http://blah.com/") 'http://blah.com/'
The text of it's Uri.py says This function is similar to urlparse.urljoin() and urllib.basejoin(). Those functions, however, are (as of Python 2.3) outdated, buggy, and/or designed to produce results acceptable for use with other core Python libraries, rather than being earnest implementations of the relevant specs. Their problems are most noticeable in their handling of same-document references and 'file:' URIs, both being situations that come up far too often to consider the functions reliable enough for general use. """ # Reasons to avoid using urllib.basejoin() and urlparse.urljoin(): # - Both are partial implementations of long-obsolete specs. # - Both accept relative URLs as the base, which no spec allows. # - urllib.basejoin() mishandles the '' and '..' references. # - If the base URL uses a non-hierarchical or relative path, # or if the URL scheme is unrecognized, the result is not # always as expected (partly due to issues in RFC 1808). # - If the authority component of a 'file' URI is empty, # the authority component is removed altogether. If it was # not present, an empty authority component is in the result. # - '.' and '..' segments are not always collapsed as well as they # should be (partly due to issues in RFC 1808). # - Effective Python 2.4, urllib.basejoin() *is* urlparse.urljoin(), # but urlparse.urljoin() is still based on RFC 1808. In searching the archives http://mail.python.org/pipermail/python-dev/2005-September/056152.html Fabien Schwob:
I'm using the module urlparse and I think I've found a bug in the urlparse module. When you merge an url and a link like"../../../page.html" with urljoin, the new url created keep some "../" in it. Here is an example :
import urlparse begin = "http://www.example.com/folder/page.html" end = "../../../otherpage.html" urlparse.urljoin(begin, end) 'http://www.example.com/../../otherpage.html'
Guido:
You shouldn't be giving more "../" sequences than are possible. I find the current behavior acceptable.
(Aparently for RFC 1808 that's a valid answer; it was an implementation choice in how to handle that case.) While not directly relevant, postings like John J Lee's http://mail.python.org/pipermail/python-bugs-list/2006-February/031875.html
The urlparse.urlparse() code should not be changed, for backwards compatibility reasons.
strongly suggest a desire to not change that code. The last definitive statement on this topic that I could find was mentioned in http://www.python.org/dev/summary/2005-11-16_2005-11-30/#updating-urlparse-t...
Guido pointed out that the main purpose of urlparse is to be RFC-compliant. Paul explained that the current code is valid according to RFC 1808 (1995-1998), but that this was superceded by RFC 2396 (1998-2004) and RFC 3986 (2005-). Guido was convinced, and asked for a new API (for backwards compatibility) and a patch to be submitted via sourceforge.
As this is not a bug, I have added the feature request 1591035 to SF titled "update urlparse to RFC 3986". Nothing else appeared to exist on that specific topic. Andrew dalke@dalkescientific.com