[Python-Dev] [ssl] The weird case of IDNA

Christian Heimes christian at python.org
Fri Dec 29 15:54:46 EST 2017


This mail is about internationalized domain names and TLS/SSL. It
doesn't concern you if you live in ASCII-land. Me and a couple of other
developers like to change the ssl module in a backwards-incompatible way
to fix IDN support for TLS/SSL.

Simply speaking the IDNA standards (internationalized domain names for
applications) describe how to encode non-ASCII domain names. The DNS
system and X.509 certificates cannot handle non-ASCII host names. Any
non-ASCII part of a hostname is punyencoded. For example the host name
'www.bücher.de' (books) is translated into 'www.xn--bcher-kva.de'. In
IDNA terms, 'www.bücher.de' is called an IDN U-label (unicode) and
'www.xn--bcher-kva.de' an IDN A-label (ASCII). Please refer to the TR64
document [1] for more information.

In a perfect world, it would be very simple. We'd only had one IDNA
standard. However there are multiple standards that are incompatible
with each other. The German TLD .de demands IDNA-2008 with UTS#46
compatibility mapping. The hostname 'www.straße.de' maps to
'www.xn--strae-oqa.de'. However in the older IDNA 2003 standard,
'www.straße.de' maps to 'www.strasse.de', but 'strasse.de' is a totally
different domain!

CPython has only support for IDNA 2003.

It's less of an issue for the socket module. It only converts text to
IDNA bytes on the way in. All functions support bytes and text. Since
IDNA encoding does change ASCII and IDNA-encoded data is ASCII, it is
also no problem to pass IDNA2008-encoded text or bytes to all socket


>>> import socket
>>> import idna  # from PyPI
>>> names = ['straße.de', b'strasse.de', idna.encode('straße.de'),
>>> for name in names:
...     print(name, socket.getaddrinfo(name, None, socket.AF_INET,
socket.SOCK_STREAM, 0, socket.AI_CANONNAME)[0][3:5])
straße.de ('strasse.de', ('', 0))
b'strasse.de' ('strasse.de', ('', 0))
b'xn--strae-oqa.de' ('xn--strae-oqa.de', ('', 0))
xn--strae-oqa.de ('xn--strae-oqa.de', ('', 0))

As you can see, 'straße.de' is canonicalized as 'strasse.de'. The IDNA
2008 encoded hostname maps to a different IP address.

On the other hand ssl module is currently completely broken. It converts
hostnames from bytes to text with 'idna' codec in some places, but not
in all. The SSLSocket.server_hostname attribute and callback function
SSLContext.set_servername_callback() are decoded as U-label.
Certificate's common name and subject alternative name fields are not
decoded and therefore A-labels. The *must* stay A-labels because
hostname verification is only defined in terms of A-labels. We even had
a security issue once, because partial wildcard like 'xn*.example.org'
must not match IDN hosts like 'xn--bcher-kva.example.org'.

In issue [2] and PR [3], we all agreed that the only sensible fix is to
make 'SSLContext.server_hostname' an ASCII text A-label. But this is an
backwards incompatible fix. On the other hand, IDNA is totally broken
without the fix. Also in my opinion, PR [3] is not going far enough.
Since we have to break backwards compatibility anyway, I'd like to
modify SSLContext.set_servername_callback() at the same time.

- Is everybody OK with breaking backwards compatibility? The risk is
small. ASCII-only domains are not affected and IDNA users are broken anyway.
- Should I only fix 3.7 or should we consider a backport to 3.6, too?


[1] https://www.unicode.org/reports/tr46/
[2] https://bugs.python.org/issue28414
[3] https://github.com/python/cpython/pull/3010

More information about the Python-Dev mailing list