
On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy@gmail.com> wrote:
I love unicode and use unicode when I can use it. But this is a problem in the real world. For example, Python 2 is convenient for analyzing line based logs containing some different encodings. Python 3
...deliberately makes that difficult because it is *wrong*. Binary files containing a mixture of encodings cannot be safely treated as text. The closest it is possible to get is to support only ASCII compatible encodings by decoding it as ASCII with the "surrogateescape" error handler so that bytes with the high order bit set can be faithfully reproduced on reencoding. However, such code will potentially fail once it encounters a non-ASCII compatible encoding, such as UTF-16 or -32. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia