[issue10546] UTF-16-LE and UTF-16-BE support non-BMP characters

New submission from STINNER Victor <victor.stinner@haypocalc.com>: Python3 doc tells that UTF-16-LE and UTF-16-BE only support BMP characters. What? I think that it is wrong. It was maybe wrong with Python2 and narrow build (unichr() only supports BMP characters), but it is no more true in Python3. ---------- assignee: docs@python components: Documentation files: utf_16_bmp.patch keywords: patch messages: 122479 nosy: docs@python, haypo priority: normal severity: normal status: open title: UTF-16-LE and UTF-16-BE support non-BMP characters versions: Python 3.1, Python 3.2 Added file: http://bugs.python.org/file19830/utf_16_bmp.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Terry J. Reedy <tjreedy@udel.edu> added the comment: Marc or Alexander, can you confirm that the patch is correct? ---------- assignee: docs@python -> cgw nosy: +belopolsky, cgw, lemburg, terry.reedy stage: -> commit review _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Changes by Terry J. Reedy <tjreedy@udel.edu>: ---------- assignee: cgw -> _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Changes by Terry J. Reedy <tjreedy@udel.edu>: ---------- assignee: -> docs@python _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: If Victor says so ... Someone needs to check that it works on a UCS4 build, but on a narrow build I don't think UTF-16-XX encodings need to do anything special - they just encode the surrogates as ordinary code units.
'\U00010000'.encode('UTF-16-BE').decode('UTF-16-BE') == '\U00010000' True '\U00010000'.encode('UTF-16-LE').decode('UTF-16-LE') == '\U00010000' True
---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: I have verified that UTF-16-XX encodings work on wide build. The doc change LGTM. Bonus points for checking that we have unit tests for these encodings that include non-BMP characters. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

Changes by Alexander Belopolsky <belopolsky@users.sourceforge.net>: ---------- components: +Unicode _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment: Fixed by r87135. ---------- resolution: -> fixed status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10546> _______________________________________
participants (3)
-
Alexander Belopolsky
-
STINNER Victor
-
Terry J. Reedy