On Tue, Feb 14, 2012 at 9:39 PM, Carl M. Johnson <cmjohnson.mailinglist@gmail.com> wrote:
On Feb 13, 2012, at 10:45 PM, Nick Coghlan wrote:
(Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish, as they would in the latin-1 case. I guess its only on Windows and in any other locations with a more restrictive default stdout encoding that errors are particularly likely).
I don't think that's right. I think that by default Python refuses to turn surrogate characters into UTF-8:
Oops, that's what I get for posting without testing :) Still, your example clearly illustrates the point I was trying to make - that using "ascii+surrogateescape" is less likely to silently corrupt the data stream than using "latin-1", because attempts to encode it under the "strict" error handler will generally fail, even for an otherwise universal encoding like UTF-8.
OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym?
I don't see any reason to obfuscate the use of "latin-1" as a workaround that maps 8-bit bytes directly to the corresponding Unicode code points. My proposal would be two-fold: Firstly, that we document three alternatives for working with arbitrary ASCII compatible encodings (from simplest to most flexible): 1. Use the "latin-1" encoding The latin-1 encoding accepts arbitrary binary data by mapping individual bytes directly to the first 256 Unicode code points. Thus, any sequence of bytes may be translated to a sequence of code points, effectively reproducing the behaviour of Python 2's 8-bit strings. If all data supplied is genuinely in an ASCII compatible encoding then this will work correctly. However, it fails badly if the supplied data is ever in an ASCII incompatible encoding, or if the decoded string is written back out using a different encoding. Using this option switches off *all* of Python 3's support for ensuring transcoding correctness - errors will frequently pass silently and result in corrupted output data rather than explicit exceptions. 2. Use the "ascii" encoding with the "surrogateescape" error handler This is the most correct approach that doesn't involve attempting to guess the string encoding. Behaviour if given data in an ASCII incompatible encoding is still unpredictable (and likely to result in data corruption). This approach retains most of Python 3's support for ensuring transcoding correctness, while still accepting any ASCII compatible encoding. If UnicodeEncodeErrors when displaying surrogate escaped strings are not desired, sys.stdout should also be updated to use the "backslashreplace" error handler. (see below) 3. Initially process the data as binary, using the "chardet" package from PyPI to guess the encoding This is the most correct option that can even cope with many ASCII incompatible encodings. Unfortunately, the chardet site is gone, since Mark Pilgrim took down his entire web presence. This (including the dead home page link from the PyPI entry) would need to be addressed before its use could be recommended in the official documentation (or, failing that, is there a properly documented alternative package available?) Secondly, that we make it easy to replace a TextIOWrapper with an equivalent wrapper that has only selected settings changed (e.g. encoding or errors). In 3.2, that is currently not possible, since the original "newline" argument is not made available as a public attribute. The closest we can get is to force universal newlines mode along with whatever other changes we want to make: old = sys.stdout sys.stdout = io.TextIOWrapper(old.buffer, old.encoding, "backslashreplace", None, old.line_buffering) 3.3 currently makes this even worse by accepting a "write_through" argument that isn't available for introspection. I propose that we make it possible to write the above as: sys.stdout = sys.stdout.rewrap(errors="backslashreplace") For the latter point, see http://bugs.python.org/issue14017 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia