[Python-Dev] Possible bug in urllib.urljoin

Andrew Edmondson a.edmondson at eris.qinetiq.com
Fri Sep 23 09:35:06 CEST 2005


Dear all,

We've found a problem using urllib.urljoin when upgrading
from python 2.3 to 2.4. It no longer joins a particular
corner case of URLs correctly (we think!).

The code appears to follow the algorithm (from
http://www.ietf.org/rfc/rfc1808.txt) for resolving urls
almost exacty...

I believe the problem occurs when reaching "step 5" (approx
line 160) which will happen if the embedded url has no
scheme, netloc or path (and is nonempty).

Following the algorithm the resulting url should now be
returned using the base urls scheme,netloc and path but the
embedded urls params / query (if present else set to base
ones) which follows in 2.3:

    if not path:
        if not params:
            params = bparams
            if not query:
                query = bquery
        return urlunparse((scheme, netloc, bpath,
                           params, query, fragment))

However in 2.4, even if the embedded urls path is empty,
unless the params and query segments are empty too, flow
passes to step 6.

    if not (path or params or query):
        return urlunparse((scheme, netloc, bpath,
                           bparams, bquery, fragment))

and thus the last segment of the base path will be removed
in order to append the embedded url's path, but the path is
empty! and so the resulting path is returned incorrectly.

Can you tell me if this was a deliberate decision to move
from following the algorithm? If so then we'll work around it.
-- 
##############################################################################
Andrew Edmondson
PGP Key: http://search.keyserver.net:11371/pks/lookup?op=get&search=0xCEE814DC
PGP Fingerprint: 7B32 4D1E AC4F 29E2 9EAA 9550 1A3D BBA4 CEE8 14DC



More information about the Python-Dev mailing list