[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 09:03:03 CET 2012

On Wed, Feb 15, 2012 at 2:12 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>
>  > using "ascii+surrogateescape" for your own I/O and setting
>  > "backslashreplace" on sys.stdout should cover you (and any
>  > exceptions you get will be warning you about cases where your
>  > original assumptions about not caring about Unicode validity have
>  > been proven wrong).
>
> Are you saying you know more than the user about her application?

No, I'm merely saying that at least 3 options (latin-1,
ascii+surrogateescape, chardet2) should be presented clearly to
beginners and the trade-offs explained.

For example:

Task: Process data in any ASCII compatible encoding
Unicode Awareness Care Factor: None
Approach: Specify encoding="latin-1"
    Bytes/bytearray: data.decode("latin-1")
    Text files: open(fname, encoding="latin-1")
    Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer, "latin-1")
    Stdout replacement (pipeline): sys.stdout =
io.TextIOWrapper(sys.stdout.buffer, "latin-1", line_buffered=True)
    Stdout replacement (terminal): Leave it alone

By decoding with latin-1, an application won't get *any* Unicode
decoding errors, as that encoding maps byte values directly to the
first 256 Unicode code points. However, any output data generated by
that application *will* be corrupted if the assumption of ASCII
compatibility are violated, or if implicit transcoding to any encoding
other than "latin-1" occurs (e.g. when writing to sys.stdout or a log
file, communicating over a network socket or serialising the string
the json module). This is the closest Python 3 comes to emulating the
permissive behaviour of Python 2's 8-bit strings (implicit
interoperation with byte sequences is still disallowed).

Task: Process data in any ASCII compatible encoding
Unicode Awareness Care Factor: Minimal
Approach: Use encoding="ascii" and errors="surrogateescape" (or,
alternatively, errors="backslashreplace" for sys.stdout)
    Bytes/bytearray: data.decode("ascii", errors="surrogateescape")
    Text files: open(fname, encoding="ascii", "surrogateescape")
    Stdin replacement: sys.stdin = io.TextIOWrapper(sys.stdin.buffer,
"ascii", "surrogateescape")
    Stdout replacement (pipeline): sys.stdout =
io.TextIOWrapper(sys.stdout.buffer, "ascii", "surrogateescape",
line_buffered=True)
    Stdout replacement (terminal): sys.stdout =
io.TextIOWrapper(sys.stdout.buffer, sys.stdout.encoding,
"backslashreplace", line_buffered=True)

Using "ascii+surrogateescape" instead of "latin-1" is a small initial
step into the Unicode-aware world. It still lets an application
process any ASCII-compatible encoding *without* having to know the
exact encoding of the source data, but will complain if there is an
implicit attempt to transcode the data to another encoding, or if the
application inserts non-ASCII data into the strings before writing
them out. Whether non-ASCII compatible encodings trigger errors or get
corrupted will depend on the specifics of the encoding and how the
program manipulates the data.

The "backslashreplace" error handler (enabled by default for
sys.stderr, optionally enabled as shown above for sys.stdout) can be
useful to help ensure that printing out strings will not trigger
UnicodeEncodeErrors (note: the *repr* of strings already escapes
non-ASCII characters internally, such that repr(x) == ascii(x). Thus,
UnicodeEncodeErrors will occur only when encoding the string itself
using the "strict" error handler, or when another library performs
equivalent validation on the string).

Task: Process data in any ASCII compatible encoding
Unicode Awareness Care Factor: High
Approach: Use binary APIs and the "chardet2" module from PyPI to
detect the character encoding
    Bytes/bytearray: data.decode(detected_encoding)
    Text files: open(fname, encoding=detected_encoding)

The *right* way to process text in an unknown encoding is to do your
best to derive the encoding from the data stream. The "chardet2"
module on PyPI allows this. Refer to that module's documentation
(WHERE?) for details.

With this approach, transcoding to the default sys.stdin and
sys.stdout encodings should generally work (although the default
restrictive character set on Windows and in some locales may cause
problems).

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia