[New-bugs-announce] [issue19451] urlparse accepts invalid hostnames

Daniele Sluijters report at bugs.python.org
Wed Oct 30 14:16:02 CET 2013


New submission from Daniele Sluijters:

Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept URI/URL's with underscores in the host/domain/subdomain. I believe this behaviour to be incorrect.

A distinction needs to be made between DNS names and Uniform Resource Locators and Identifiers, urlparse is supposed to deal with the latter (correct me if I'm wrong).

According to RFC 2181 section 11 on the syntax of DNS names the use of the underscore is allowed and in use around the internet, especially in TXT and SRV records.

However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates) always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.

On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.  

The underscore is never mentioned as being a valid character nor do any of the references in the RFC's as far as I've been able to see. 

Languages implementations vary:
 * Ruby URI.parse does not allow for underscores in domain labels.
 * Perl URI and URI::URL allow for underscores.
 * java.net.uri treats the underscore as an illegal character in the domain part.
 * org.apache.http.httphost since 4.2.3 treats the underscore as an illegal character in the domain part.

Httpd's:
 * Apache: Seems to tolerate underscores but there's been a whole discussion about this on the mailing lists.
 * nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems to accept server_names with underscores in them but the behaviour is currently unknown to me.

Browsers:
 * IE cannot write cookies since IE 5.5 if host or subdomain part includes an underscore.
 * Just about every other browser is fine with it.

Please note that I'm only talking about the host/domain/subdomain part of URI's and URL's, something like http://en.wikipedia.org/wiki/12-hour_clock is perfectly valid and should parse.

----------
components: Library (Lib)
messages: 201730
nosy: daenney, orsenthil
priority: normal
severity: normal
status: open
title: urlparse accepts invalid hostnames
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 3.4, Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19451>
_______________________________________


More information about the New-bugs-announce mailing list