[New-bugs-announce] [issue15866] encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs
report at bugs.python.org
Wed Sep 5 20:54:34 CEST 2012
New submission from Wim:
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character.
Here's a transcript (Python 2.7.3, x86_64):
>>> b = '\xf0\x9f\x92\x9d'
>>> u = b.decode('utf8')
>>> u.encode('ascii', errors='xmlcharrefreplace')
>>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
>>> u.encode('utf8', errors='xmlcharrefreplace')
The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'.
As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint.
(FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.)
components: Library (Lib), Unicode
nosy: ezio.melotti, wiml
title: encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs
versions: Python 2.7
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce