[ python-Bugs-639311 ] urllib.basejoin() mishandles ''

Fri Mar 26 09:05:42 EST 2004

Bugs item #639311, was opened at 2002-11-16 04:34
Message generated for change (Comment added) made by bcannon
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=639311&group_id=5470

Category: Python Library
Group: Python 2.2.1
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Mike Brown (mike_j_brown)
Assigned to: Brett Cannon (bcannon)
Summary: urllib.basejoin() mishandles ''

Initial Comment:
It&#039;s not entirely clear whether urllib.basejoin() intends to 
implement RFC 2396&#039;s "resolution of relative URI 
references to absolute form" faithfully, but it seems to 
behave improperly when given an empty string as the 
relative URI to make absolute.

>>> from urllib import basejoin
>>> basejoin(&#039;http://host/foo/bar.xml&#039;,&#039;&#039;)
&#039;http://host/foo/&#039;

I believe it should return the base as-is, because the 
empty string is a reference to the document that 
contains that reference... and presumably the 
document&#039;s URI is what you&#039;re passing in as the base.

----------------------------------------------------------------------

>Comment By: Brett Cannon (bcannon)
Date: 2004-03-26 06:05

Message:
Logged In: YES 
user_id=357491

I actually just did ``from urlparse import urljoin as basejoin`` and 
deleted the code for basejoin.  So I have already implemented your 
suggestion in Python 2.4 .

Thanks for the help, Mike.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-26 01:59

Message:
Logged In: YES 
user_id=371366

Given that it's not part of the public API and isn't used 
anywhere other than urllib.FancyURLOpener, I think that 
should be safe.

As for whether I recommend it, well, I don't know. Cons are 
that urlparse.urljoin() is currently slower, makes the call stack 
a bit deeper, and has some minor issues related to the fact 
that the entire module was coded to implement facets of RFC 
1808, which has been obsolete since Aug 1998. RFC 1808 has 
various issues mainly relating to how it handles relative paths, 
path params, paths containing '.' or '..' segments, and the 
finer points of merging query and fragment parts. None of 
these are showstoppers, but urljoin doesn't even get 1808 
completely right, though -- it accepts relative base URLs, for 
example, which is a contradiction in terms that has never 
been allowed by the specs. (basejoin does this, too) Another 
possible con is that importing a function from urlparse into 
urllib might not be desirable. I don't know what your policy is 
on introducing dependencies between core libs.

Alternatively, what I think might be a better option is to just 
change urllib.FancyURLOpener.redirect_internal() to use 
urlparse.urljoin() instead of urllib.basejoin(). Then, you could 
just remove urllib.basejoin() altogether.

For the record, the main differences in behavior that you can 
expect to see when using urljoin instead of basejoin will be 
that it will better handle the case where the new URL in the 
redirect is an empty string or just '..'. There will also be some 
weird behavior, in some cases, with URLs that use 'file' or 
unrecognized URL schemes, but basejoin() wouldn't have 
done any better.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2004-03-25 07:47

Message:
Logged In: YES 
user_id=357491

There is a misunderstanding about what I meant by "documented".  I 
meant it is not a specific part of the public API since it is not documented 
at http://www.python.org/dev/doc/devel/lib/module-urllib.html which is 
the official docs for urllib.  Having a docstring does not mean the code is 
considered part of the public API.  Sorry I didn't clarify that.

Now you obviously know a lot about this Mike, so can you give me your 
opinion on whether replacing urllib.basejoin with urlparse.urljoin is a 
reasonable move?

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-03-23 15:15

Message:
Logged In: YES 
user_id=371366

Well, it is documented as "Utility to combine a URL with a 
base URL to form a new URL".

The notion of 'base URL' is described in the RFCs mentioned in 
urllib.__doc__ (all of which predate RFC 2396, I notice, 
though it doesn't matter) and is exclusively applicable to 
relative URL/URI reference resolution, a process in which an 
empty path denotes a same-document reference / no 
traversal from the origin. So I don't think it was unreasonable 
of me to have an expectation that basejoin() would be at 
least somewhat conformant to the specs that its parent 
module purports to implement.

And it actually *is* conformant when the relative URL 
consists of just a query or fragment; the empty path doesn't 
result in the tail of the base URL being chopped off before the 
query or fragment is added/replaced. So this lends further 
credence to the notion that the intent was to be conformant.

Furthermore, in the one and only place where it is used in 
core python libs at all (in urllib.FancyURLOpener), basejoin() 
is used in a way that would give a wrong result, if left as-is: 
to derive an absolute URL so a redirect can be followed, when 
given a relative URL in the Location header of an HTTP 302 
response. Granted, the odds of this header being empty are 
slim, but still. It's a bug in FancyURLOpener.redirect_internal, 
at least.

Nevertheless, I am now of the opinion that anyone serious 
about conformant URL handling will avoid urllib. In 4Suite I 
have implemented wrappers and replacements for the 
functions we need. See also bug #649962.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2004-03-23 13:37

Message:
Logged In: YES 
user_id=357491

I disagree with what you are expecting.  For instance, if you run 
``urllib.basejoin("http://python.org/index.html", "/doc")`` it returns 
"http://python.org/dev", which makes sense.  So changing its behavior 
based on it being an empty string would not strictly match how the 
function works when compared to being given any string.

And on top of things the function is not even documented, so you really 
shouldn't be expecting any specific behavior.

Closing this as invalid.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2003-05-21 20:14

Message:
Logged In: YES 
user_id=357491

Perhaps urllib.basejoin (which is not documented) should just become a 
wrapper for urlparse.urljoin ?  It won't solve this bug but it would cut back 
on unneeded code.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2002-11-26 02:41

Message:
Logged In: YES 
user_id=371366

I was partly mistaken; the document's URI is not necessarily 
the base. A reference with an empty path (e.g., an empty 
string or just a fragment identifier) is a reference to the current 
document, regardless of the base URI you are resolving 
against. A base URI is only for resolving relative URIs that are 
not referencing the current document. See some discussion 
at http://lists.w3.org/Archives/Public/uri/2002Jan/0015.html

So neither urllib.basejoin() nor urlparse.urljoin() fully 
implement the RFC 2396 "resolution to absolute form", since 
there would need to be a way to indicate "current document" 
other than returning the base.

Nevertheless, basejoin()'s behavior differs from urlparse.urljoin
()'s when presented with the empty string, and it's not clear 
whether that is intentional.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=639311&group_id=5470