Nick Coghlan writes:
On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy@gmail.com> wrote:
I love unicode and use unicode when I can use it. But this is a problem in the real world. For example, Python 2 is convenient for analyzing line based logs containing some different encodings.
Where's the use case for bytes here?
Python 3
...deliberately makes that difficult because it is *wrong*.
Nick, you should have stopped there. :-) I can see very little difference between Python 2 and Python 3 in this use case, except that Python 2 makes it much easier to write easily crashable programs. In both versions, the safe thing to do for such a program is either to slurp the whole log with open(log, encoding=<whatever>, errors=<something nonfatal>) (that's Python 3 code; Python 2 makes this more tedious, in fact). But no need for reading as bytes in Python 3 visible here, move along, people! Alternatively, one could write a function that reads lines from the log as bytes, and tries different encodings for each line (perhaps interacting with the user) and eventually uses some default encoding and a nonfatal error handler to get *something*. This requires reading as bytes, but it's no easier to write in Python 2 AFAICS. Granted, such a function will not easily be portable between Python 2 and 3, but that's a different problem.
Binary files containing a mixture of encodings cannot be safely treated as text.
"Safety" is use-case-dependent. I suppose Inada-san considers using Python 2 strs to receive file input safe enough for his log analyzer. While we shouldn't encourage that (and either errors='ignore' or errors='surrogateescape' should be easy enough for him in the log analysis case[1]), I don't think we should demand GIGO with 100% fidelity in all use cases, either. Footnotes: [1] In new code. Again, a port of existing Python 2 code to Python 3 might not be trivial, depending on how he handles unexpected encodings and how pervasively they are manipulated in his program.