[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 19:40:25 CET 2012

It seems we once again agree violently on the principles.  I think our
differences here are mostly due to me giving a lot of attention to
audience and presentation, and you focusing on the content of what to say.

Re: spin control:

Nick Coghlan writes:

 > No, I'm merely saying that at least 3 options (latin-1,
 > ascii+surrogateescape, chardet2) should be presented clearly to
 > beginners and the trade-offs explained.

Are you defining "beginner" as "Python 2 programmer experienced in a
multilingual context but new to Python 3"?

My point is that, by other definitions of "beginner", I don't think
the tradeoffs can be usefully explained to beginners without
substantial discussion of the issues involved in ASCII vs. the
encoding Babel vs. Unicode.  Only in extreme cases where the beginner
only cares about *never* getting a Unicode error, or only cares about
*never* getting mojibake, will they be able to get much out of this.

Re: descriptions

 > Task: Process data in any ASCII compatible encoding
 > Unicode Awareness Care Factor: None

I don't understand what "Unicode awareness" means here.  The degree to
which Python will raise Unicode errors?  The awareness of the programmer?

 > Approach: Specify encoding="latin-1"
[...]
 > first 256 Unicode code points. However, any output data generated by
 > that application *will* be corrupted

As advice, I think this is mostly false.  In particular, unless you do
language-specific manipulations (transforming particular words and the
like), the Latin-N family is going to be 6-sigma interoperable with
Latin-1, and the rest of the ISO 8859 and Windows-125x family
tolerably so.  This is why it is so hard to root out the "Python 3 is
just Unicode-me-harder by another name" meme.  The most you should say
here is that data *may* be corrupted and that, depending on the
program, the risk *may* be non-negligible for non-Latin-1 data if you
ever encounter it.

 > Using "ascii+surrogateescape" instead of "latin-1" is a small initial
 > step into the Unicode-aware world. It still lets an application
 > process any ASCII-compatible encoding *without* having to know the
 > exact encoding of the source data, but will complain if there is an
 > implicit attempt to transcode the data to another encoding,

That last line would be better "attempt to validate the data, or
output it without an error-suppressing handler (which may occur
implicitly, in a module your program uses)."

 > or if the application inserts non-ASCII data into the strings
 > before writing them out. Whether non-ASCII compatible encodings
 > trigger errors or get corrupted will depend on the specifics of the
 > encoding and how the program manipulates the data.

You can be a little more precise: Non-ASCII-compatible encodings will
trigger errors in the same circumstances as ASCII-compatible
encodings.  They also likely to be corrupted, but depending on the
specifics of the encoding and how the program manipulates the data.  I
don't know if it's worth the extra verbosity, though.

 > Task: Process data in any ASCII compatible encoding
 > Unicode Awareness Care Factor: High
 > Approach: Use binary APIs and the "chardet2" module from PyPI to
 > detect the character encoding
 >     Bytes/bytearray: data.decode(detected_encoding)
 >     Text files: open(fname, encoding=detected_encoding)
 > 
 > The *right* way to process text in an unknown encoding is to do your
 > best to derive the encoding from the data stream.

The claim of "right" isn't good advice.  The *right* way to process
text is to insist on knowing the encoding in advance.  If you have to
process text in unknown encodings, then what is "right" will vary with
the application.

For one thing, accurate detection generally impossible without advice
from outside.  Given inaccuracy of automatic detection, I would often
prefer to fall back to a generic ASCII-compatible algorithm that omits
any processing that requires identifying non-ASCII characters or
inserting non-ASCII characters into the text stream, rather than risk
mojibake.

In other cases, all of the significant processing is done on ASCII
characters, and non-ASCII is simply passed through verbatim.  Then if
you need to process text in assorted encodings, the 'latin1' method is
not merely acceptable, it is the obvious winning strategy.

And to some extent the environment:

 > [T]he default restrictive character set on Windows and in some
 > locales may cause problems.

In sum, most likely naive use of chardet is most effective as a way to
rule out non-ASCII-compatible encodings, which *can* be done rather
accurately (Shift JIS, Big5, UTF-16, and UTF-32 all have
characteristic patterns of use of non-ASCII octets).