[Python-Dev] bug in urlparse
Duncan Booth
duncan.booth at suttoncourtenay.org.uk
Tue Sep 6 13:51:24 CEST 2005
jepler at unpythonic.net wrote in news:20050904233804.GA2731 at unpythonic.net:
> According to RFC 2396[1] section 5.2:
>
> g) If the resulting buffer string still begins with one or more
> complete path segments of "..", then the reference is
> considered to be in error. Implementations may handle this
> error by retaining these components in the resolved path (i.e.,
> treating them as part of the final URI), by removing them from
> the resolved path (i.e., discarding relative levels above the
> root), or by avoiding traversal of the reference.
>
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").
>
Yes, the urljoin behaviour is explicitly allowed, however it is not the
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox
handle this error by stripping the spurious .. elements from the front of
the path. Apache, and I hope other web servers, work by the third permitted
method, i.e. rejecting requests to these invalid urls.
The net effect of this is that on some sites using a Python spider (e.g.
webchecker.py) will produce a large number of error messages for links
which browsers will actually resolve successfully. (At least that's when I
first noticed this particular problem). Depending on your reasons for
spidering a site this can be either a good thing or an annoyance.
More information about the Python-Dev
mailing list