[New-bugs-announce] [issue11461] Reading UTF-16 with codecs.readline() breaks on surrogate pairs
Yuriy Pilgun
report at bugs.python.org
Thu Mar 10 11:19:58 CET 2011
New submission from Yuriy Pilgun <ply at ukrpost.net>:
Reading UTF-16 text file with module 'codecs' fails, if surrogate pair is located at 72-character boundary.
Attached python script fails with message:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 70-71: unexpected end of data
The reason is splitting of input data for readline() into chunks, namely
readsize = size or 72
----------
components: Library (Lib), Unicode
files: testutf16.py
messages: 130498
nosy: ply
priority: normal
severity: normal
status: open
title: Reading UTF-16 with codecs.readline() breaks on surrogate pairs
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file21070/testutf16.py
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue11461>
_______________________________________
More information about the New-bugs-announce
mailing list