Extracting real-domain-name (without sub-domains) from a given URL
steve at holdenweb.com
Tue Jan 13 12:43:37 CET 2009
S.Selvam Siva wrote:
> On Tue, Jan 13, 2009 at 1:50 PM, Chris Rebert <clp2 at rebertia.com> wrote:
>> On Mon, Jan 12, 2009 at 11:46 PM, S.Selvam Siva <s.selvamsiva at gmail.com> wrote:
>>> Hi all,
>>> I need to extract the domain-name from a given url(without sub-domains).
>>> With urlparse, i am able to fetch only the domain-name(which includes the
>>> sub-domain also).
>>> http://feeds.huffingtonpost.com/posts/ , http://www.huffingtonpost.de/,
>>> .... all must lead to huffingtonpost.com or huffingtonpost.de
>>> Please suggest me some ideas regarding this problem.
>> That would require (pardon the pun) domain-specific logic. For most
>> TLDs (e.g. .com, .org) the domain name is just blah.com, blah.org,
>> etc. But for ccTLDs, often only second-level registrations are
>> allowed, e.g. for www.bbc.co.uk, so the main domain name would be
>> bbc.co.uk I think a few TLDs have even more complicated rules.
>> I doubt anyone's created a general ready-made solution for this, you'd
>> have to code it yourself.
>> To handle the common case, you can cheat and just .split() at the
>> periods and then slice and rejoin the list of domain parts, ex:
> Thank you Chris Rebert,
> Actually i tried with domain specific logic.Having 200 TLD like
> .com,co.in,co.uk and tried to extract the domain name.
> But my boss want more reliable solution than this method,any way i
> will try to find some alternative solution.
If you post a good first try, opening the source, I would be surprised
if others do not join your effort to establish suitable rules. This is
somethjing that many people could doubtless use.
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
More information about the Python-list