Sort by domain name?

Tim Chase python.list at tim.thechases.com
Mon Oct 2 17:45:23 CEST 2006


>> Here, domain name doesn't contain subdomain, or should I
>> say, domain's part of 'www', mail, news and en should be 
>> excluded.
> 
> It's a little more complicated, you have to treat co.uk about
>  the same way as .com, and similarly for some other countries 
> but not all.  For example, subdomain.companyname.de versus 
> subdomain.companyname.com.au or subdomain.companyname.co.uk. 
> You end up needing a table or special code to say how to treat
> various countries.

In addition, you get very different results even on just "base"
domain-name, such as "whitehouse" based on whether you use the
".gov" or ".com" variant of the TLD. Thus, I'm not sure there's
any way to discern this example from the "yahoo.com" vs.
"yahoo.co.uk" variant without doing a boatload of WHOIS queries,
which in turn might be misleading anyways.

A first-pass solution might look something like:

##############################################################>>> 
sites
['http://mail.google.com', 'http://reader.google.com', 
'http://mail.yahoo.co.uk', 'http://google.com', 
'http://mail.yahoo.com']
 >>> sitebits = [site.lower().lstrip('http://').split('.') for 
site in sites]
 >>> for site in sitebits: site.reverse()
...
 >>> sorted(sitebits)
[['com', 'google'], ['com', 'google', 'mail'], ['com', 'google', 
'reader'], ['co
m', 'yahoo', 'mail'], ['uk', 'co', 'yahoo', 'mail']]
 >>> results = ['http://' + ('.'.join(reversed(site))) for site 
in sorted(sitebits)]
 >>> results
['http://google.com', 'http://mail.google.com', 
'http://reader.google.com', 'http://mail.yahoo.com', 
'http://mail.yahoo.co.uk']
##############################################################

which can be wrapped up like this:

##############################################################
 >>> def sort_by_domain(sites):
...     sitebits = [site.lower().lstrip('http://').split('.') for 
site in sites]
...     for site in sitebits: site.reverse()
...     return ['http://' + ('.'.join(reversed(site))) for site 
in sorted(sitebits)]
...
 >>> s = sites
 >>> sort_by_domain(sites)
['http://google.com', 'http://mail.google.com', 
'http://reader.google.com', 'http://mail.yahoo.com', 
'http://mail.yahoo.co.uk']
##############################################################

to give you a sorting function.  It assumes http rather than 
having mixed url-types, such as ftp or mailto.  They're easy 
enough to strip off as well, but putting them back on becomes a 
little more exercise.

Just a few ideas,

-tkc







More information about the Python-list mailing list