[New-bugs-announce] [issue9593] utf8 codec readlines error after "\x85 "

Fri Aug 13 21:27:03 CEST 2010

New submission from Joseph Copenhaver <joseph.copenhaver at gmail.com>:

The IO readlines() facility incorrectly processes utf8 files for some unknown reason. Specifically, the call generates too many entries in the lines array result after a character sequence "\x85 blah" which gets cut as ("\x85 ","blah") according the the resultant array. My workaround for this issue is not elegant, especially since I need the newline characters:

#BEGIN: WTF
a_str_whole = fs_in.read()
fs_in.close()
a_str_lines = a_str_whole.split("\n")
for idx in range(0,len(a_str_lines)-1):
   a_str_lines[idx]+="\n"
#END: WTF

Attached is an example script that defines the problem clearly.

----------
components: IO, Interpreter Core, Regular Expressions, Unicode
files: ErrorProof-utf8-x85.py
messages: 113818
nosy: jcope
priority: normal
severity: normal
status: open
title: utf8 codec readlines error after "\x85 "
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file18508/ErrorProof-utf8-x85.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9593>
_______________________________________