[Python-Dev] bug in urlparse
duncan.booth at suttoncourtenay.org.uk
Tue Sep 6 13:51:24 CEST 2005
jepler at unpythonic.net wrote in news:20050904233804.GA2731 at unpythonic.net:
> According to RFC 2396 section 5.2:
> g) If the resulting buffer string still begins with one or more
> complete path segments of "..", then the reference is
> considered to be in error. Implementations may handle this
> error by retaining these components in the resolved path (i.e.,
> treating them as part of the final URI), by removing them from
> the resolved path (i.e., discarding relative levels above the
> root), or by avoiding traversal of the reference.
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").
Yes, the urljoin behaviour is explicitly allowed, however it is not the
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox
handle this error by stripping the spurious .. elements from the front of
the path. Apache, and I hope other web servers, work by the third permitted
method, i.e. rejecting requests to these invalid urls.
The net effect of this is that on some sites using a Python spider (e.g.
webchecker.py) will produce a large number of error messages for links
which browsers will actually resolve successfully. (At least that's when I
first noticed this particular problem). Depending on your reasons for
spidering a site this can be either a good thing or an annoyance.
More information about the Python-Dev