[ssl] The weird case of IDNA

Hi, tl;dr This mail is about internationalized domain names and TLS/SSL. It doesn't concern you if you live in ASCII-land. Me and a couple of other developers like to change the ssl module in a backwards-incompatible way to fix IDN support for TLS/SSL. Simply speaking the IDNA standards (internationalized domain names for applications) describe how to encode non-ASCII domain names. The DNS system and X.509 certificates cannot handle non-ASCII host names. Any non-ASCII part of a hostname is punyencoded. For example the host name 'www.bücher.de' (books) is translated into 'www.xn--bcher-kva.de'. In IDNA terms, 'www.bücher.de' is called an IDN U-label (unicode) and 'www.xn--bcher-kva.de' an IDN A-label (ASCII). Please refer to the TR64 document [1] for more information. In a perfect world, it would be very simple. We'd only had one IDNA standard. However there are multiple standards that are incompatible with each other. The German TLD .de demands IDNA-2008 with UTS#46 compatibility mapping. The hostname 'www.straße.de' maps to 'www.xn--strae-oqa.de'. However in the older IDNA 2003 standard, 'www.straße.de' maps to 'www.strasse.de', but 'strasse.de' is a totally different domain! CPython has only support for IDNA 2003. It's less of an issue for the socket module. It only converts text to IDNA bytes on the way in. All functions support bytes and text. Since IDNA encoding does change ASCII and IDNA-encoded data is ASCII, it is also no problem to pass IDNA2008-encoded text or bytes to all socket functions. Example:
import socket import idna # from PyPI names = ['straße.de', b'strasse.de', idna.encode('straße.de'), idna.encode('straße.de').encode('ascii')] for name in names: ... print(name, socket.getaddrinfo(name, None, socket.AF_INET, socket.SOCK_STREAM, 0, socket.AI_CANONNAME)[0][3:5]) ... straße.de ('strasse.de', ('89.31.143.1', 0)) b'strasse.de' ('strasse.de', ('89.31.143.1', 0)) b'xn--strae-oqa.de' ('xn--strae-oqa.de', ('81.169.145.78', 0)) xn--strae-oqa.de ('xn--strae-oqa.de', ('81.169.145.78', 0))
As you can see, 'straße.de' is canonicalized as 'strasse.de'. The IDNA 2008 encoded hostname maps to a different IP address. On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'. In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label. But this is an backwards incompatible fix. On the other hand, IDNA is totally broken without the fix. Also in my opinion, PR [3] is not going far enough. Since we have to break backwards compatibility anyway, I'd like to modify SSLContext.set_servername_callback() at the same time. Questions: - Is everybody OK with breaking backwards compatibility? The risk is small. ASCII-only domains are not affected and IDNA users are broken anyway. - Should I only fix 3.7 or should we consider a backport to 3.6, too? Regards, Christian [1] https://www.unicode.org/reports/tr46/ [2] https://bugs.python.org/issue28414 [3] https://github.com/python/cpython/pull/3010

This being a security issue I think it's okay to break 3.6. might even backport to 3.5 if it's easy? On Dec 29, 2017 1:59 PM, "Christian Heimes" <christian@python.org> wrote:
Hi,
tl;dr This mail is about internationalized domain names and TLS/SSL. It doesn't concern you if you live in ASCII-land. Me and a couple of other developers like to change the ssl module in a backwards-incompatible way to fix IDN support for TLS/SSL.
Simply speaking the IDNA standards (internationalized domain names for applications) describe how to encode non-ASCII domain names. The DNS system and X.509 certificates cannot handle non-ASCII host names. Any non-ASCII part of a hostname is punyencoded. For example the host name 'www.bücher.de <http://www.xn--bcher-kva.de>' (books) is translated into ' www.xn--bcher-kva.de'. In IDNA terms, 'www.bücher.de <http://www.xn--bcher-kva.de>' is called an IDN U-label (unicode) and 'www.xn--bcher-kva.de' an IDN A-label (ASCII). Please refer to the TR64 document [1] for more information.
In a perfect world, it would be very simple. We'd only had one IDNA standard. However there are multiple standards that are incompatible with each other. The German TLD .de demands IDNA-2008 with UTS#46 compatibility mapping. The hostname 'www.straße.de <http://www.strasse.de>' maps to 'www.xn--strae-oqa.de'. However in the older IDNA 2003 standard, 'www.straße.de <http://www.strasse.de>' maps to 'www.strasse.de', but ' strasse.de' is a totally different domain!
CPython has only support for IDNA 2003.
It's less of an issue for the socket module. It only converts text to IDNA bytes on the way in. All functions support bytes and text. Since IDNA encoding does change ASCII and IDNA-encoded data is ASCII, it is also no problem to pass IDNA2008-encoded text or bytes to all socket functions.
Example:
import socket import idna # from PyPI names = ['straße.de <http://strasse.de>', b'strasse.de', idna.encode(' straße.de <http://strasse.de>'), idna.encode('straße.de <http://strasse.de>').encode('ascii')] for name in names: ... print(name, socket.getaddrinfo(name, None, socket.AF_INET, socket.SOCK_STREAM, 0, socket.AI_CANONNAME)[0][3:5]) ... straße.de <http://strasse.de> ('strasse.de', ('89.31.143.1', 0)) b'strasse.de' ('strasse.de', ('89.31.143.1', 0)) b'xn--strae-oqa.de' ('xn--strae-oqa.de', ('81.169.145.78', 0)) xn--strae-oqa.de ('xn--strae-oqa.de', ('81.169.145.78', 0))
As you can see, 'straße.de <http://strasse.de>' is canonicalized as ' strasse.de'. The IDNA 2008 encoded hostname maps to a different IP address.
On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label. But this is an backwards incompatible fix. On the other hand, IDNA is totally broken without the fix. Also in my opinion, PR [3] is not going far enough. Since we have to break backwards compatibility anyway, I'd like to modify SSLContext.set_servername_callback() at the same time.
Questions: - Is everybody OK with breaking backwards compatibility? The risk is small. ASCII-only domains are not affected and IDNA users are broken anyway. - Should I only fix 3.7 or should we consider a backport to 3.6, too?
Regards, Christian
[1] https://www.unicode.org/reports/tr46/ [2] https://bugs.python.org/issue28414 [3] https://github.com/python/cpython/pull/3010
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org

Guido wrote: This being a security issue I think it's okay to break 3.6. might even backport to 3.5 if it's easy? Is it also a security issue with 2.x? If so, should a fix to 2.7 be contemplated? Skip

On 2017-12-30 13:19, Skip Montanaro wrote:
Guido wrote:
This being a security issue I think it's okay to break 3.6. might even backport to 3.5 if it's easy?
Is it also a security issue with 2.x? If so, should a fix to 2.7 be contemplated?
IMO the IDNA encoding problem isn't a security issue per se. The ssl module just cannot handle internationalized domain names at all. IDN domains always fail to verify. Users may just be encouraged to disable hostname verification. On the other hand the use of IDNA 2003 and lack of IDNA 2008 support [1] can be considered a security problem for German, Greek, Japanese, Chinese and Korean domains [2]. I neither have resources nor expertise to address the encoding issue. Christian [1] https://bugs.python.org/issue17305 [2] https://www.unicode.org/reports/tr46/#Transition_Considerations

On Fri, 29 Dec 2017 21:54:46 +0100 Christian Heimes <christian@python.org> wrote:
On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label.
What are the changes in API terms? If I'm calling wrap_socket(), can I pass `server_hostname='straße'` and it will IDNA-encode it? Or do I have to encode it myself? If the latter, it seems like we are putting the burden of protocol compliance on users. Regards Antoine.

On 2017-12-30 11:28, Antoine Pitrou wrote:
On Fri, 29 Dec 2017 21:54:46 +0100 Christian Heimes <christian@python.org> wrote:
On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label.
What are the changes in API terms? If I'm calling wrap_socket(), can I pass `server_hostname='straße'` and it will IDNA-encode it? Or do I have to encode it myself? If the latter, it seems like we are putting the burden of protocol compliance on users.
Only SSLSocket.server_hostname attribute and the hostname argument to the SNI callback will change. Both values will be A-labels instead of U-labels. You can still pass an U-label to the server_hostname argument and it will be encoded with "idna" encoding.
sock = ctx.wrap_socket(socket.socket(), server_hostname='www.straße.de')
Currently:
sock.server_hostname 'www.straße.de'
Changed:
sock.server_hostname 'www.strasse.de'
Christian

Thanks. So the change sounds ok to me. Regards Antoine. On Sat, 30 Dec 2017 14:34:04 +0100 Christian Heimes <christian@python.org> wrote:
On 2017-12-30 11:28, Antoine Pitrou wrote:
On Fri, 29 Dec 2017 21:54:46 +0100 Christian Heimes <christian@python.org> wrote:
On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label.
What are the changes in API terms? If I'm calling wrap_socket(), can I pass `server_hostname='straße'` and it will IDNA-encode it? Or do I have to encode it myself? If the latter, it seems like we are putting the burden of protocol compliance on users.
Only SSLSocket.server_hostname attribute and the hostname argument to the SNI callback will change. Both values will be A-labels instead of U-labels. You can still pass an U-label to the server_hostname argument and it will be encoded with "idna" encoding.
sock = ctx.wrap_socket(socket.socket(), server_hostname='www.straße.de')
Currently:
sock.server_hostname 'www.straße.de'
Changed:
sock.server_hostname 'www.strasse.de'
Christian
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/python-python-dev%40m.gma...

ssl.match_hostname was added in Python 2.7.9, looks like Python 2 should be fixed as well. On Sat, Dec 30, 2017 at 3:50 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
Thanks. So the change sounds ok to me.
Regards
Antoine.
On Sat, 30 Dec 2017 14:34:04 +0100 Christian Heimes <christian@python.org> wrote:
On 2017-12-30 11:28, Antoine Pitrou wrote:
On Fri, 29 Dec 2017 21:54:46 +0100 Christian Heimes <christian@python.org> wrote:
On the other hand ssl module is currently completely broken. It
converts
hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org ' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label.
What are the changes in API terms? If I'm calling wrap_socket(), can I pass `server_hostname='straße'` and it will IDNA-encode it? Or do I have to encode it myself? If the latter, it seems like we are putting the burden of protocol compliance on users.
Only SSLSocket.server_hostname attribute and the hostname argument to the SNI callback will change. Both values will be A-labels instead of U-labels. You can still pass an U-label to the server_hostname argument and it will be encoded with "idna" encoding.
sock = ctx.wrap_socket(socket.socket(), server_hostname=' www.straße.de <http://www.strasse.de>')
Currently:
sock.server_hostname 'www.straße.de <http://www.strasse.de>'
Changed:
sock.server_hostname 'www.strasse.de'
Christian
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/python-python-dev%40m.gma...
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/andrew.svetlov%40gmail.co...
-- Thanks, Andrew Svetlov

On Sat, Dec 30, 2017 at 2:28 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 29 Dec 2017 21:54:46 +0100 Christian Heimes <christian@python.org> wrote:
On the other hand ssl module is currently completely broken. It converts hostnames from bytes to text with 'idna' codec in some places, but not in all. The SSLSocket.server_hostname attribute and callback function SSLContext.set_servername_callback() are decoded as U-label. Certificate's common name and subject alternative name fields are not decoded and therefore A-labels. The *must* stay A-labels because hostname verification is only defined in terms of A-labels. We even had a security issue once, because partial wildcard like 'xn*.example.org' must not match IDN hosts like 'xn--bcher-kva.example.org'.
In issue [2] and PR [3], we all agreed that the only sensible fix is to make 'SSLContext.server_hostname' an ASCII text A-label.
What are the changes in API terms? If I'm calling wrap_socket(), can I pass `server_hostname='straße'` and it will IDNA-encode it? Or do I have to encode it myself? If the latter, it seems like we are putting the burden of protocol compliance on users.
Part of what makes this confusing is that there are actually three intertwined issues here. (Also, anything that deals with Unicode *or* SSL/TLS is automatically confusing, and this is about both!) Issue 1: Python's built-in IDNA implementation is wrong (implements IDNA 2003, not IDNA 2008). Issue 2: The ssl module insists on using Python's built-in IDNA implementation whether you want it to or not. Issue 3: Also, the ssl module has a separate bug that means client-side cert validation has never worked for any IDNA domain. Issue 1 is potentially a security issue, because it means that in a small number of cases, Python will misinterpret a domain name. IDNA 2003 and IDNA 2008 are very similar, but there are 4 characters that are interpreted differently, with ß being one of them. Fixing this though is a big job, and doesn't exactly have anything to do with the ssl module -- for example, socket.getaddrinfo("straße.de", 80) and sock.connect("straße.de", 80) also do the wrong thing. Christian's not proposing to fix this here. It's issues 2 and 3 that he's proposing to fix. Issue 2 is a problem because it makes it impossible to work around issue 1, even for users who know what they're doing. In the socket module, you can avoid Python's automagical IDNA handling by doing it manually, and then calling socket.getaddrinfo("strasse.de", 80) or socket.getaddrinfo("xn--strae-oqa.de", 80), whichever you prefer. In the ssl module, this doesn't work. There are two places where ssl uses hostnames. In client mode, the user specifies the server_hostname that they want to see a certificate for, and then the module runs this through Python's IDNA machinery *even if* it's already properly encoded in ascii. And in server mode, when the user has specified an SNI callback so they can find out which certificate an incoming client connection is looking for, the module runs the incoming name through Python's IDNA machinery before handing it to user code. In both cases, the right thing to do would be to just pass through the ascii A-label versions, so savvy users can do whatever they want with them. (This also matches the general design principle around IDNA, which assumes that the pretty unicode U-labels are used only for UI purposes, and everything internal uses A-labels.) Issue 3 is just a silly bug that needs to be fixed, but it's tangled up here because the fix is the same as for Issue 2: the reason client-side cert validation has never worked is that we've been taking the A-label from the server's certificate and checking if it matches the U-label we expect, and of course it never does because we're comparing strings in different encodings. If we consistently converted everything to A-labels as soon as possible and kept it that way, then this bug would never have happened. What makes it tricky is that on both the client and the server, fixing this is actually user-visible. On the client, checking sslsock.server_hostname used to always show a U-label, but if we stop using U-labels internally then this doesn't make sense. Fortunately, since this case has never worked at all, fixing it shouldn't cause any problems. On the server, the obvious fix would be to start passing A-label-encoded names to the servername_callback, instead of U-label-encoded names. Unfortunately, this is a bit trickier, because this *has* historically worked (AFAIK) for IDNA names, so long as they didn't use one of the four magic characters who changed meaning between IDNA 2003 and IDNA 2008. But we do still need to do something. For example, right now, it's impossible to use the ssl module to implement a web server at https://straße.de, because incoming connections will use SNI to say that they expect a cert for "xn--strae-oqa.de", and then the ssl module will freak out and throw an exception instead of invoking the servername callback. It's ugly, but probably the simplest thing is to add a new function like set_servername_callback2 that uses the A-label, and then redefine set_servername_callback as a deprecated compatibility shim: def set_servername_callback(self, cb): def shim_cb(sslobj, servername, sslctx): if servername is not None: servername = servername.encode("ascii").decode("idna") return cb(sslobj, servername, sslctx) self.set_servername_callback2(shim_cb) We can bikeshed what the new name should be. Maybe set_sni_callback? or set_server_hostname_callback, since the corresponding client-mode argument is server_hostname? -n -- Nathaniel J. Smith -- https://vorpus.org

On Sat, 30 Dec 2017 23:27:04 -0800 Nathaniel Smith <njs@pobox.com> wrote:
We can bikeshed what the new name should be. Maybe set_sni_callback? or set_server_hostname_callback, since the corresponding client-mode argument is server_hostname?
Or set_idna_servername_callback(). Regards Antoine.

Nathaniel Smith writes:
Issue 1: Python's built-in IDNA implementation is wrong (implements IDNA 2003, not IDNA 2008).
Is "wrong" the right word here? I'll grant you that 2008 is *better*, but typically in practice versions coexist for years. Ie, is there no backward compatibility issue with registries that specified IDNA 2003? This is not entirely an idle question: I'd like to tool up on the RFCs, research existing practice (especially in the East/Southeast Asian registries), and contribute to the implementation if there may be an issue remaining. (Interpreting RFCs is something I'm reasonably good at.) Steve

On Dec 31, 2017 7:37 AM, "Stephen J. Turnbull" < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote: Nathaniel Smith writes:
Issue 1: Python's built-in IDNA implementation is wrong (implements IDNA 2003, not IDNA 2008).
Is "wrong" the right word here? I'll grant you that 2008 is *better*, but typically in practice versions coexist for years. Ie, is there no backward compatibility issue with registries that specified IDNA 2003? Well, yeah, I was simplifying, but at the least we can say that always and only using IDNA 2003 certainly isn't right :-). I think in most cases the preferred way to deal with these kinds of issues is not to carry around an IDNA 2003 implementation, but instead to use an IDNA 2008 implementation with the "transitional compatibility" flag enabled in the UTS46 preprocessor? But this is rapidly exceeding my knowledge. This is another reason why we ought to let users do their own IDNA handling if they want... This is not entirely an idle question: I'd like to tool up on the RFCs, research existing practice (especially in the East/Southeast Asian registries), and contribute to the implementation if there may be an issue remaining. (Interpreting RFCs is something I'm reasonably good at.) Maybe this is a good place to start: https://github.com/kjd/idna/blob/master/README.rst -n [Sorry if my quoting is messed up; posting from my phone and Gmail for Android apparently generates broken text/plain.]

On Sun, Dec 31, 2017 at 09:07:01AM -0800, Nathaniel Smith wrote:
This is another reason why we ought to let users do their own IDNA handling if they want...
I expect that letting users do their own IDNA handling will correspond to not doing any IDNA handling at all. -- Steve

On Sun, Dec 31, 2017 at 5:39 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Dec 31, 2017 at 09:07:01AM -0800, Nathaniel Smith wrote:
This is another reason why we ought to let users do their own IDNA handling if they want...
I expect that letting users do their own IDNA handling will correspond to not doing any IDNA handling at all.
You did see the words "if they want", right? I'm not talking about removing the stdlib's default IDNA handling, I'm talking about fixing the cases where the stdlib goes out of its way to prevent users from overriding its IDNA handling. And "users" here is a very broad category; it includes libraries like requests, twisted, trio, ... that are already doing better IDNA handling than the stdlib, except in cases where the stdlib actively prevents it. -n -- Nathaniel J. Smith -- https://vorpus.org

On Sun, Dec 31, 2017 at 05:51:47PM -0800, Nathaniel Smith wrote:
On Sun, Dec 31, 2017 at 5:39 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Dec 31, 2017 at 09:07:01AM -0800, Nathaniel Smith wrote:
This is another reason why we ought to let users do their own IDNA handling if they want...
I expect that letting users do their own IDNA handling will correspond to not doing any IDNA handling at all.
You did see the words "if they want", right?
Yes. Its the people who don't know that they ought to handle IDNA that concern me. They would "want to" if they knew they ought to, but they don't because they never even thought of non-ASCII URLs and consequently they write libraries or applications open to IDNA security issues.
I'm not talking about removing the stdlib's default IDNA handling, I'm talking about fixing the cases where the stdlib goes out of its way to prevent users from overriding its IDNA handling.
That wasn't clear to me. I completely agree that the stdlib preventing people from overriding the IDNA is a bad thing that ought to be fixed, and that users should be able to opt out of it (presumably if they know enough to do that, they know enough to avoid IDNA vulnerabilities). I thought you meant it ought to be opt-in. Sorry for misunderstanding you, but your wording suggested to me that you meant that the stdlib shouldn't do IDNA handling at all unless the user did it themselves (perhaps by calling an IDNA library in the std lib). I see now that's not what you meant. -- Steve

On Mon, Jan 1, 2018 at 12:39 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Dec 31, 2017 at 09:07:01AM -0800, Nathaniel Smith wrote:
This is another reason why we ought to let users do their own IDNA handling if they want...
I expect that letting users do their own IDNA handling will correspond to not doing any IDNA handling at all.
That'll lead to one of two possibilities: 1) People use Unicode strings to represent domain names. Python's existing IDNA handling will happen; they're not doing their own. Not what you're talking about. Or: 2) People use byte strings to represent domain names. Any non-ASCII characters will simply cause an exception, if I'm not mistaken. Safe, but not as functional. ChrisA

On 31 Dec 2017, at 18:07, Nathaniel Smith <njs@pobox.com> wrote:
On Dec 31, 2017 7:37 AM, "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp <mailto:turnbull.stephen.fw@u.tsukuba.ac.jp>> wrote: Nathaniel Smith writes:
Issue 1: Python's built-in IDNA implementation is wrong (implements IDNA 2003, not IDNA 2008).
Is "wrong" the right word here? I'll grant you that 2008 is *better*, but typically in practice versions coexist for years. Ie, is there no backward compatibility issue with registries that specified IDNA 2003?
Well, yeah, I was simplifying, but at the least we can say that always and only using IDNA 2003 certainly isn't right :-). I think in most cases the preferred way to deal with these kinds of issues is not to carry around an IDNA 2003 implementation, but instead to use an IDNA 2008 implementation with the "transitional compatibility" flag enabled in the UTS46 preprocessor? But this is rapidly exceeding my knowledge.
This is another reason why we ought to let users do their own IDNA handling if they want…
Do you know what the major browser do w.r.t. IDNA support? If those unconditionally use IDNA 2008 is should be fairly safe to move to that in Python as well because that would mean we’re less likely to run into backward compatibility issues. Ronald

Christian Heimes writes:
tl;dr This mail is about internationalized domain names and TLS/SSL. It doesn't concern you if you live in ASCII-land. Me and a couple of other developers like to change the ssl module in a backwards-incompatible way to fix IDN support for TLS/SSL.
Yes please! Seriously, we *need* to fix the bug for German, and I would presume other languages that have used pure-ASCII transcodings, which I bet are in very common use in domain names. Do you have an issue # for this offhand? If not I'll just go dig it out for myself.
In a perfect world, it would be very simple. We'd only had one IDNA standard. However there are multiple standards that are incompatible with each other.
You forgot the obligatory XKCD: https://www.xkcd.com/927. ;-)
The German TLD .de demands IDNA-2008 with UTS#46 compatibility mapping. The hostname 'www.straße.de' maps to 'www.xn--strae-oqa.de'. However in the older IDNA 2003 standard, 'www.straße.de' maps to 'www.strasse.de', but 'strasse.de' is a totally different domain!
That's a mess! I bet the domain squatters have had a field day.
Questions: - Is everybody OK with breaking backwards compatibility? The risk is small. ASCII-only domains are not affected
That's not quite true, as your German example shows. In some Oriental renderings it is impossible to distinguish halfwidth digits from full-width ones as the same glyphs are used. (This occasionally happens with other ASCII characters, but users are more fussy about digits lining up.) That is, while technically ASCII-only domain names are not affected, users of ASCII-only domain names are potentially vulnerable to confusable names when IDNA is introduced. (Hopefully the Asian registrars are as woke as the German ones! But you could still register a .com containing full-width digits or letters.)
and IDNA users are broken anyway.
Agree with your analysis, except for the fine point above. Japanese don't use IDNA much yet (except like the WIDE folks, who know what they're doing), so I have little experience with potential breakage. On the other hand that suggests that transitioning quickly will be helpful.
- Should I only fix 3.7 or should we consider a backport to 3.6, too?
3.7 has a *lot* of new stuff in it. I suspect a lot of people are going to take their time moving production sites to it, so +1 on a backport. 3.5 too, if it's not too hard.

On Sat, Dec 30, 2017 at 7:26 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Christian Heimes writes:
Questions: - Is everybody OK with breaking backwards compatibility? The risk is small. ASCII-only domains are not affected
That's not quite true, as your German example shows. In some Oriental renderings it is impossible to distinguish halfwidth digits from full-width ones as the same glyphs are used. (This occasionally happens with other ASCII characters, but users are more fussy about digits lining up.) That is, while technically ASCII-only domain names are not affected, users of ASCII-only domain names are potentially vulnerable to confusable names when IDNA is introduced. (Hopefully the Asian registrars are as woke as the German ones! But you could still register a .com containing full-width digits or letters.)
This particular example isn't an issue: in IDNA encoding, full-width and half-width digits are normalized together, so number1.com and number1.com actually refer to the same domain name. This is true in both the 2003 and 2008 versions: # IDNA 2003 In [7]: "number\uff11.com".encode("idna") Out[7]: b'number1.com' # IDNA 2008 (using the 'idna' package from pypi) In [8]: idna.encode("number\uff11.com", uts46=True) Out[8]: b'number1.com' That said, IDNA does still allow for a bunch of spoofing opportunities that aren't possible with pure ASCII, and this requires some care: https://unicode.org/faq/idn.html#16 This is mostly a UI issue, though; there's not much that the socket or ssl modules can do to help here. -n -- Nathaniel J. Smith -- https://vorpus.org
participants (10)
-
Andrew Svetlov
-
Antoine Pitrou
-
Chris Angelico
-
Christian Heimes
-
Guido van Rossum
-
Nathaniel Smith
-
Ronald Oussoren
-
Skip Montanaro
-
Stephen J. Turnbull
-
Steven D'Aprano