[Python-Dev] bug in urlparse

Duncan Booth duncan.booth at suttoncourtenay.org.uk
Tue Sep 6 13:51:24 CEST 2005

jepler at unpythonic.net wrote in news:20050904233804.GA2731 at unpythonic.net:

> According to RFC 2396[1] section 5.2:
>       g) If the resulting buffer string still begins with one or more
>          complete path segments of "..", then the reference is
>          considered to be in error.  Implementations may handle this
>          error by retaining these components in the resolved path (i.e.,
>          treating them as part of the final URI), by removing them from
>          the resolved path (i.e., discarding relative levels above the
>          root), or by avoiding traversal of the reference.
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").

Yes, the urljoin behaviour is explicitly allowed, however it is not the 
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox 
handle this error by stripping the spurious .. elements from the front of 
the path. Apache, and I hope other web servers, work by the third permitted 
method, i.e. rejecting requests to these invalid urls.

The net effect of this is that on some sites using a Python spider (e.g. 
webchecker.py) will produce a large number of error messages for links 
which browsers will actually resolve successfully. (At least that's when I 
first noticed this particular problem). Depending on your reasons for 
spidering a site this can be either a good thing or an annoyance.

More information about the Python-Dev mailing list