urlparse incomplete ?

Dave Brueck dave at pythonapocrypha.com
Thu Oct 31 12:20:02 EST 2002


On Thu, 31 Oct 2002, maxm wrote:

> I am writing code where I need to convert relative urls to absolute 
> urls. For this I use the urlparse module.
> 
> First of, urlparse doesn't take care of urls where there is a user and 
> password. Shouldn't that be corrected?

Define "doesn't take care of". :) It doesn't error out - it just returns
the user name and password as part of the netloc. In that sense you could
also say it doesn't take care of URLs where the port is included. Since
URLs with username/password aren't the common case, maybe the thing to add
would be a separate function in that module that, say, takes the output
from urlparse and returns (username, password, host, port) or something?  
To me that would be really useful because I always end up doing stuff
like:

parts = urlparse.urlparse(url)
host, port = (parts[1].split(':') + ['80'])[:2]

so that I can use them in connecting a socket.

:(

> Also I have seen absolute urls of the form "//www.wired.com/path". The 
> ambiguity here being caused by the "//" instead of "http://".
> 
> The rfc is weak in describing if this is a legal url

Is it? The BNF of URLs (section 5 of the rfc) goes a little like this:

genericurl = scheme ":" schemepart
url = httpurl | ... | genericurl
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
httpurl = "http://" ...

Maybe I'm reading it wrong, but it appears that the scheme is required. 
The browser is just helping lazy/confused users - just like how many 
browsers will accept "www.python.org" and know you meant 
"http://www.python.org/".

> but the browser handles them well and just sets the default scheme to
> http. So I have to handle it.

Since the browser is, at its core, an HTTP application, it is a reasonable
bet to assume HTTP. The same does not hold true for the urlparse module
though, as you pointed out.

> Naturally urlparse cannot know that the scheme is http, but if this is 
> indeed a legal url, shouldn't it be possible to set the default scheme 
> when calling urlparse?

IMO, no, but that's just me. :) Since the spec seems to indicate that 
those aren't valid URLs, I don't think urlparse should do any more than it 
currently does: return '' for the scheme. 

-Dave





More information about the Python-list mailing list