[New-bugs-announce] [issue11461] Reading UTF-16 with codecs.readline() breaks on surrogate pairs

Thu Mar 10 11:19:58 CET 2011

New submission from Yuriy Pilgun <ply at ukrpost.net>:

Reading UTF-16 text file with module 'codecs' fails, if surrogate pair is located at 72-character boundary.

Attached python script fails with message:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 70-71: unexpected end of data

The reason is splitting of input data for readline() into chunks, namely
  readsize = size or 72

----------
components: Library (Lib), Unicode
files: testutf16.py
messages: 130498
nosy: ply
priority: normal
severity: normal
status: open
title: Reading UTF-16 with codecs.readline() breaks on surrogate pairs
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file21070/testutf16.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue11461>
_______________________________________