[issue9377] socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names

David Watson report at bugs.python.org
Fri Jul 30 20:11:46 CEST 2010


David Watson <baikie at users.sourceforge.net> added the comment:

OK, here are new versions of the original patches.

I've tweaked the docs to make clear that ASCII-compatible
encodings actually *are* ASCII, and point to an explanation as
soon as they're mentioned.

You're right that PyUnicode_AsEncodedString() is the preferable
interface for the argument converter (I think I got
PyUnicode_AsEncodedObject() from an old version of
PyUnicode_FSConverter() :/), but for the ASCII step I've just
short-circuited it and used PyUnicode_EncodeASCII() directly,
since the converter has already checked that the object is of
Unicode type.  For the IDNA step, PyUnicode_AsEncodedString()
should result in a less confusing error message if the codec
returns some non-bytes object one day.

However, the PyBytes_Check isn't to check up on the codec, but to
check for a bytes argument, which the converter also supports.
For that reason, I think encode_hostname would be a misleading
name, but I've renamed it hostname_converter after the example of
PyUnicode_FSConverter, and renamed unicode_from_hostname to
decode_hostname.

I've also made the converter check for UnicodeEncodeError in the
ASCII step, but the end result really is UnicodeError if the IDNA
step fails, because the "idna" codec does not use
UnicodeEncodeError or UnicodeDecodeError.  Complain about that if
you wish :)


I think the example I gave in the previous comment was also
confusing, so just to be clear...

In /etc/hosts (in UTF-8 encoding):

127.0.0.2       €
127.0.0.3       xn--lzg


Without patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'


With patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 167, in encode
    result.extend(ToASCII(label))
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 76, in ToASCII
    label = nameprep(label)
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 38, in nameprep
    raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'


The exception at the end demonstrates why surrogateescape strings
don't get confused with IDNs.

----------
Added file: http://bugs.python.org/file18272/ascii-surrogateescape-2.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9377>
_______________________________________


More information about the Python-bugs-list mailing list