[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 01:02:20 CET 2012

On Tue, Feb 14, 2012 at 9:39 PM, Carl M. Johnson
<cmjohnson.mailinglist at gmail.com> wrote:
>
> On Feb 13, 2012, at 10:45 PM, Nick Coghlan wrote:
>
>> (Of course, if sys.stdout.encoding is "UTF-8", then you're right, those
>> characters will just be displayed as gibberish, as they would in the
>> latin-1 case. I guess its only on Windows and in any other locations
>> with a more restrictive default stdout encoding that errors are
>> particularly likely).
>
> I don't think that's right. I think that by default Python refuses to turn surrogate characters into UTF-8:

Oops, that's what I get for posting without testing :)

Still, your example clearly illustrates the point I was trying to make
- that using "ascii+surrogateescape" is less likely to silently
corrupt the data stream than using "latin-1", because attempts to
encode it under the "strict" error handler will generally fail, even
for an otherwise universal encoding like UTF-8.

> OK, so concrete proposals: update the docs and maybe make a synonym for Latin-1 that makes it more semantically obvious that you're not really using it as Latin-1, just as a easy to pass through encoding. Anything else? Any bike shedding on the synonym?

I don't see any reason to obfuscate the use of "latin-1" as a
workaround that maps 8-bit bytes directly to the corresponding Unicode
code points. My proposal would be two-fold:

Firstly, that we document three alternatives for working with
arbitrary ASCII compatible encodings (from simplest to most flexible):

1. Use the "latin-1" encoding

The latin-1 encoding accepts arbitrary binary data by mapping
individual bytes directly to the first 256 Unicode code points. Thus,
any sequence of bytes may be translated to a sequence of code points,
effectively reproducing the behaviour of Python 2's 8-bit strings. If
all data supplied is genuinely in an ASCII compatible encoding then
this will work correctly. However, it fails badly if the supplied data
is ever in an ASCII incompatible encoding, or if the decoded string is
written back out using a different encoding. Using this option
switches off *all* of Python 3's support for ensuring transcoding
correctness - errors will frequently pass silently and result in
corrupted output data rather than explicit exceptions.

2. Use the "ascii" encoding with the "surrogateescape" error handler

This is the most correct approach that doesn't involve attempting to
guess the string encoding. Behaviour if given data in an ASCII
incompatible encoding is still unpredictable (and likely to result in
data corruption). This approach retains most of Python 3's support for
ensuring transcoding correctness, while still accepting any ASCII
compatible encoding.

If UnicodeEncodeErrors when displaying surrogate escaped strings are
not desired, sys.stdout should also be updated to use the
"backslashreplace" error handler. (see below)

3. Initially process the data as binary, using the "chardet" package
from PyPI to guess the encoding

This is the most correct option that can even cope with many ASCII
incompatible encodings. Unfortunately, the chardet site is gone, since
Mark Pilgrim took down his entire web presence. This (including the
dead home page link from the PyPI entry) would need to be addressed
before its use could be recommended in the official documentation (or,
failing that, is there a properly documented alternative package
available?)

Secondly, that we make it easy to replace a TextIOWrapper with an
equivalent wrapper that has only selected settings changed (e.g.
encoding or errors). In 3.2, that is currently not possible, since the
original "newline" argument is not made available as a public
attribute. The closest we can get is to force universal newlines mode
along with whatever other changes we want to make:

    old = sys.stdout
    sys.stdout = io.TextIOWrapper(old.buffer, old.encoding,
"backslashreplace", None, old.line_buffering)

3.3 currently makes this even worse by accepting a "write_through"
argument that isn't available for introspection.

I propose that we make it possible to write the above as:

    sys.stdout = sys.stdout.rewrap(errors="backslashreplace")

For the latter point, see http://bugs.python.org/issue14017

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia