Splitting URLs

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sun Oct 21 16:37:55 CEST 2007


I'm trying to split a URL into components. For example:

URL = 'http://steve:secret@www.domain.com.au:82/dir" + \
    'ectory/file.html;params?query#fragment'


(joining the strings above with plus has no significance, it's just to 
avoid word-wrapping)

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:secret at www.domain.com.au:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlparse: it will split the URL 
into a tuple:

('http', 'steve:secret at www.domain.com.au:82', '/directory/file.html', 
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named 
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have 
to support 2.4). Before I write code to split the netloc field by hand (a 
nuisance, but doable) I thought I'd ask if there was a function somewhere 
in the standard library I had missed.

This second question isn't specifically Python related, but I'm asking it 
anyway...

I'd also like to split the domain part of a HTTP netloc into top level 
domain (.au), second level (.com), etc. I don't need to validate the TLD, 
I just need to split it. Is splitting on dots sufficient, or will that 
miss some odd corner case of the HTTP specification?

(If it does, I might decide to live with the lack... it depends on how 
odd the corner is, and how much work it takes to fix.)



-- 
Steven.



More information about the Python-list mailing list