[Python-Dev] bug in urlparse

Tue Sep 6 13:51:24 CEST 2005

jepler at unpythonic.net wrote in news:20050904233804.GA2731 at unpythonic.net:

> According to RFC 2396[1] section 5.2:
> 
>       g) If the resulting buffer string still begins with one or more
>          complete path segments of "..", then the reference is
>          considered to be in error.  Implementations may handle this
>          error by retaining these components in the resolved path (i.e.,
>          treating them as part of the final URI), by removing them from
>          the resolved path (i.e., discarding relative levels above the
>          root), or by avoiding traversal of the reference.
> 
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").
> 

Yes, the urljoin behaviour is explicitly allowed, however it is not the 
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox 
handle this error by stripping the spurious .. elements from the front of 
the path. Apache, and I hope other web servers, work by the third permitted 
method, i.e. rejecting requests to these invalid urls.

The net effect of this is that on some sites using a Python spider (e.g. 
webchecker.py) will produce a large number of error messages for links 
which browsers will actually resolve successfully. (At least that's when I 
first noticed this particular problem). Depending on your reasons for 
spidering a site this can be either a good thing or an annoyance.