[New-bugs-announce] [issue3672] Ill-formed surrogates not treated as errors during encoding/decoding

Sun Aug 24 23:56:51 CEST 2008

New submission from Adam Olsen <rhamph at gmail.com>:

The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors.  Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).

http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40

Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'

Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'

On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value.  This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'

Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'

I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.

Fixing this would cause issue 3297 to blow up loudly, rather than silently.

----------
messages: 71889
nosy: Rhamphoryncus
severity: normal
status: open
title: Ill-formed surrogates not treated as errors during encoding/decoding

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3672>
_______________________________________