URL parsing for the hard cases

John Nagle nagle at animats.com
Mon Jul 23 06:59:41 CEST 2007

Here's another hard case.  This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator:password@ june 



(u'ftp', u'administrator:password at', u'/originals/6 june 
07/ebay/login/ebayisapi.html', '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.

That's a real URL, from a search for phishing sites.  There are lots
of hostile URLs out there.  Some of which can fool some parsers.

				John Nagle

John Nagle wrote:
> memracom at yahoo.com wrote:
>> Once you eliminate IPv6 addresses, parsing is simple. Is there a
>> colon? Then there is a port number. Does the left over have any
>> characters not in [0123456789.]? Then it is a name, not an IPv4
>> address.
>> --Michael Dillon
>   You wish.  Hex input of IP addresses is allowed:
>     http://0x525eedda
> and
>     http://0x52.0x5e.0xed.0xda
> are both "Python.org".  Or just put
>     0x52.0x5e.0xed.0xda
> into the address bar of a browser.  All these work in Firefox on Windows 
> and
> are recognized as valid IP addresses.
> On the other hand,
>     0x52.com
> is a valid domain name, in use by PairNIC.
> But
>     http://test.0xda
> is handled by Firefox on Windows as a domain name.  It doesn't resolve, 
> but it's
> sent to DNS.
> So I think the question is whether every term between dots can be parsed as
> a decimal or hex number.  If all terms can be parsed as a number, and 
> there are
> no more than four of them, it's an IP address.  Otherwise it's a domain 
> name.
> There are phishing sites that pull stuff like this, and I'm parsing a 
> long list
> of such sites.  So I really do need to get the hard cases right.
> Is there any library function that correctly tests for an IP address vs. a
> domain name based on syntax, i.e. without looking it up in DNS?
>                 John Nagle

More information about the Python-list mailing list