[issue10370] py3 readlines() reports wrong offset for UnicodeDecodeError

STINNER Victor report at bugs.python.org
Tue Nov 9 02:44:07 CET 2010


STINNER Victor <victor.stinner at haypocalc.com> added the comment:

The error occurs in .readline(): .readline() fills a buffer by reading the file chunk by chunk. Each time a chunk is read, it is decoded by the stateful decoder. The problem is that the decoder doesn't know the file offset. Even if it knew, start and end attributes of UnicodeDecodeError are indexes in the (bytes) object.

> but reports an error at offset 4096 (reported as "0")

4096 is the buffer_size attribute of BufferedReader: .readline() -> ._read_chunk() -> .buffer.read1().

> The misreported offset does not occur with read(), just with readlines().

.read() is special: it reads the whole file at once, and decode binary content at once.

--

I don't consider this issue as a bug, and so I'm closing it as invalid.

--

Use .readline() to locate an invalid byte is not the right algorithm. If you would like to do that, you should open the file in binary mode and decodes the content yourself, chunk by chunk. Or if you manipulate small files, you can use .read() as you wrote.

----------
nosy: +haypo
resolution:  -> invalid
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10370>
_______________________________________


More information about the Python-bugs-list mailing list