[ python-Bugs-548176 ] urlparse doesn't handle host?bla
SourceForge.net
noreply at sourceforge.net
Mon Jan 26 20:13:02 EST 2004
Bugs item #548176, was opened at 2002-04-24 08:36
Message generated for change (Comment added) made by mrovner
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470
Category: Python Library
Group: Python 2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Markus Demleitner (msdemlei)
Assigned to: Nobody/Anonymous (nobody)
Summary: urlparse doesn't handle host?bla
Initial Comment:
The urlparse module (at least in 2.2 and 2.1, Linux)
doesn't
handle URLs of the form
http://www.maerkischeallgemeine.de?loc_id=49 correctly
-- everything up to the 9 ends up in the host. I
didn't check the RFC, but in the real world URLs like
this do show up. urlparse works fine when there's a
trailing slash on the host name:
http://www.maerkischeallgemeine.de/?loc_id=49
Example:
<pre>
>>> import urlparse
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49")
('http', 'www.maerkischeallgemeine.de', '/', '',
'loc_id=49', '')
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49")
('http', 'www.maerkischeallgemeine.de?loc_id=49', '',
'', '', '')
</pre>
This has serious implications for urllib, since
urllib.urlopen will fail for URLs like the second one,
and with a pretty mysterious exception ("host not
found") at that.
----------------------------------------------------------------------
Comment By: Mike Rovner (mrovner)
Date: 2004-01-26 17:13
Message:
Logged In: YES
user_id=162094
According to RFC2396 (ftp://ftp.isi.edu/in-notes/rfc2396.txt)
absoluteURI (part 3 URI Syntactic Components) can be:
"""
<scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a
particular URI.
"""
Later on (3.2):
"""
The authority component is preceded by a double slash "//"
and is terminated by the next slash "/", question-mark "?",
or by the end of the URI.
"""
So URL "http://server?query" is perfectly legal and shall be
allowed and patch 712317 rejected.
----------------------------------------------------------------------
Comment By: Steven Taschuk (staschuk)
Date: 2003-03-30 12:19
Message:
Logged In: YES
user_id=666873
For comparison, RFC 1738 section 3.3:
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
[...] If neither <path> nor <searchpart> is present,
the "/" may also be omitted.
... which does not outright say the '/' may *not* be omitted if
<path> is absent but <searchpart> is present (though imho
that's implied).
But even if the / may not be omitted in this case, ? is not
allowed in the authority component under either RFC 2396 or
RFC 1738, so urlparse should either treat it as a delimiter or
reject the URL as malformed. The principle of being lenient in
what you accept favours the former.
I've just submitted a patch (712317) for this.
----------------------------------------------------------------------
Comment By: Jeff Epler (jepler)
Date: 2002-11-17 08:56
Message:
Logged In: YES
user_id=2772
This actually appears to be permitted by RFC2396
[http://www.ietf.org/rfc/rfc2396.txt]. See section 3.2:
3.2. Authority Component
Many URI schemes include a top hierarchical element for a
naming authority, such that the namespace defined by the
remainder of the URI is governed by that authority. This
authority component is typically defined by an
Internet-based server or a scheme-specific registry of
naming authorities.
authority = server | reg_name
The authority component is preceded by a double slash
"//" and is terminated by the next slash "/", question-mark
"?", or by the end of the URI. Within the authority
component, the characters ";", ":", "@", "?", and "/" are
reserved.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=548176&group_id=5470
More information about the Python-bugs-list
mailing list