[New-bugs-announce] [issue18282] Ugly behavior of binary and unicode handling on reading unknown encoded files

Sat Jun 22 16:54:24 CEST 2013

New submission from Sworddragon:

Currently Python 3 has some problems of handling files with an unknown encoding. In this example we have a file encoded as ISO-8859-1 with the content "ä" which should be tried to be read. Lets see what Python 3 can currently do here:

1. We can simply open the file and try to read the content. The encoding will be set in my case automatically to UTF-8. But the read() operation will throw an exception: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

2. Now lets look a little more into the arguments of open(): We will find an errors argument which could maybe be useful:
2.1. "strict" is the default behavior which was already tested.
2.2. "ignore" will not throw any exception but delete any character which can't be read. This would be problematic in many cases.
2.3. "replace" will replace any character which can't be read which will be problematic in many cases too.
2.4. "surrogateescape" can throw exceptions too: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 0: surrogates not allowed
2.5. "xmlcharrefreplace" and "backslashreplace" are not used for reading.

3. Since trying to decode the file will make many problems we can try to read the file as binary content. This will work in all cases but causing another problem: Any unicode string that must be concatenated with the content of the file must be converted to a binary string too (like b'some_unicode_content' or some_unicode_variable.encode()). The same happens for unicode strings that must be concatenated somewhere else with the newly converted unicode_to_binary variable even if they doesn't touch the file content. This behavior can affect the maintainability in a bad way.

As you can see all current solutions of Python 3 have big disadvantages. If I'm overlooking something feel free to correct me. Currently I have developed my own solution in Python which solved the problem: A function that autodetects the encoding of the file. Maybe there could also be a native way to do this on open() or maybe there could be another way found to solve this problem.

----------
components: IO
messages: 191643
nosy: Sworddragon
priority: normal
severity: normal
status: open
title: Ugly behavior of binary and unicode handling on reading unknown encoded files
type: enhancement
versions: Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18282>
_______________________________________