URL parsing for the hard cases
nagle at animats.com
Mon Jul 23 06:59:41 CEST 2007
Here's another hard case. This one might be a bug in urlparse:
s = 'ftp://administrator:email@example.com/originals/6 june
(u'ftp', u'administrator:password at 220.127.116.11', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')
That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.
That's a real URL, from a search for phishing sites. There are lots
of hostile URLs out there. Some of which can fool some parsers.
John Nagle wrote:
> memracom at yahoo.com wrote:
>> Once you eliminate IPv6 addresses, parsing is simple. Is there a
>> colon? Then there is a port number. Does the left over have any
>> characters not in [0123456789.]? Then it is a name, not an IPv4
>> --Michael Dillon
> You wish. Hex input of IP addresses is allowed:
> are both "Python.org". Or just put
> into the address bar of a browser. All these work in Firefox on Windows
> are recognized as valid IP addresses.
> On the other hand,
> is a valid domain name, in use by PairNIC.
> is handled by Firefox on Windows as a domain name. It doesn't resolve,
> but it's
> sent to DNS.
> So I think the question is whether every term between dots can be parsed as
> a decimal or hex number. If all terms can be parsed as a number, and
> there are
> no more than four of them, it's an IP address. Otherwise it's a domain
> There are phishing sites that pull stuff like this, and I'm parsing a
> long list
> of such sites. So I really do need to get the hard cases right.
> Is there any library function that correctly tests for an IP address vs. a
> domain name based on syntax, i.e. without looking it up in DNS?
> John Nagle
More information about the Python-list