[Web-SIG] urlparse method behaviour when handing abs/rel urls
orsenthil at gmail.com
Fri Jun 27 20:31:58 CEST 2008
At http://bugs.python.org/issue754016, there is a discussion wherein if a URL
is given in a normal way to urlparse (For e.g. urlparse('www.python.org')), it
parses it as a path rather than as the net_loc component as is the comman case
urlparse module tries to follow RFC 1808, where it is specified that:
2.4.3. Parsing the Network Location/Login
If the parse string begins with a double-slash "//", then the
substring of characters after the double-slash and up to, but not
including, the next slash "/" character is the network location/login
(<net_loc>) of the URL.
For treating the url as a path, the RFC specifies that after parsing, scheme,
net_loc, parameters and query, whatever is left is path.
2.4.6. Parsing the Path
After the above steps, all that is left of the parse string is the
URL <path> and the slash "/" that may precede it.
So, when 'www.python.org' is not a scheme, net_loc (as per RFC), parameter or
query, it is a path. This case looks absurd for 'www.python.org' but perfect
for parsing relative urls like just 'a'. More over this makes sense when we
have relative urls with parameters and query, for e.g.'g:h','?x'
Now, the question comes as "How do we inform the users that if they want the
net_loc of the url, they have to use // in the front".
My suggestion is through the "Docs" and "Help" message.
There is a discussion and suggestion on raising an Exception for cases when url
does not start with '//'.
As urlparse module is used for handling both absolute URLs as well as relative
URLS, this suggestion IMHO, would break the urlparse handling of all relative
urls. For e.g, Cases which are mentioned in the RFC 1808 (Section 5.1 Normal
Another way to resolve this would be to break urlparse into two methods:
and let the user decide what he wants.
Please provide your suggestions on this.
- Is the current method okay?
- Do we feel need for absparse and relparse()?
More information about the Web-SIG