urlparse.urlparse bug - misparses long URL
John Nagle
nagle at animats.com
Fri Dec 14 03:14:56 EST 2007
Matt Nordhoff wrote:
> John Nagle wrote:
>> Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
>> ====
...
>
> It's breaking on the first slash, which just happens to be very late in
> the URL.
>
>>>> urlparse('http://example.com?blahblah=http://example.net')
> ('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
That's what it seems to be doing:
sa1 = 'http://example.com?blahblah=/foo'
sa2 = 'http://example.com?blahblah=foo'
print urlparse.urlparse(sa1)
('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
print urlparse.urlparse(sa2)
('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT
That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic Syntax"),
page 23 says
"The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators."
So "urlparse" is an "older, erroneous implementation". Looking
at the code for "urlparse", it references RFC1808 (1995), which
was a long time ago, three revisions back.
Here's the bad code:
def _splitnetloc(url, start=0):
for c in '/?#': # the order is important!
delim = url.find(c, start)
if delim >= 0:
break
else:
delim = len(url)
return url[start:delim], url[delim:]
That's just wrong. The domain ends at the first appearance of
any character in '/?#', but that code returns the text before the
first '/' even if there's an earlier '?'. A URL/URI doesn't
have to have a path, even when it has query parameters.
This bug is in Python 2.4 and 2.5. I'll file a bug report.
John Nagle
SiteTruth
More information about the Python-list
mailing list