[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 19:51:29 CET 2012

I really like a task-oriented approach like this. +1000 for this sort
of thing in the docs.

On 15 February 2012 08:03, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Task: Process data in any ASCII compatible encoding

This is actually closest to how I think about what I'm doing, so
thanks for spelling it out.

> Unicode Awareness Care Factor: High

I'm not entirely sure how to interpret this - "High level of interest
in getting it right" or "High amount of investment in understanding
Unicode needed"? Or something else?

> Approach: Use binary APIs and the "chardet2" module from PyPI to
> detect the character encoding
>    Bytes/bytearray: data.decode(detected_encoding)
>    Text files: open(fname, encoding=detected_encoding)

If this is going into the Unicode FAQ or somewhere similar, it
probably needs a more complete snippet of sample code. Without having
looked for and read the chardet2 documentation, do I need to read the
file once in binary mode (possibly only partially) to scan it for an
encoding, and then start again "for real". That's arguably a downside
to this approach.

> The *right* way to process text in an unknown encoding is to do your
> best to derive the encoding from the data stream. The "chardet2"
> module on PyPI allows this. Refer to that module's documentation
> (WHERE?) for details.

There is arguably another, simpler approach, which is to pick a
default encoding (probably what Python gives you by default) and add a
command line argument to your program (or equivalent if your program
isn't a command line app) to manually specify an alternative. That's
probably more complicated than the naive user wanted to deal with when
they started reading this summary, but may well not sound so bad by
the time they get to this point :-)

> With this approach, transcoding to the default sys.stdin and
> sys.stdout encodings should generally work (although the default
> restrictive character set on Windows and in some locales may cause
> problems).

A couple of other tasks spring to mind:

Task: Process data in a file whose encoding I don't know
Unicode Understanding Needed: Medium-Low
Unicode Correctness: High
Approach: Use external tools to identify the encoding, then simply
specify it when opening the file. On Unix, "file -i FILENAME" will
attempt to detect the encoding, on Windows, XXX. If, and only if, this
approach doesn't identify the encoding clearly, then the other options
allow you to do the best you can.

(Needs a better description of what tools to use, and maybe a sample
Python script using chardet2 as a fallback).

This is actually the "right way", and should be highlighted as such.
By describing it this way, it's also rather clear that it's *not
hard*, once you get over the idea that you don't know how to get the
encoding, because it's not specified in the file.

Having read through and extended Nick's analysis to this point, I'm
thinking that it actually fits my use cases fine (and correct Unicode
handling no longer feels like such a hard problem to me :-))

Task: Process data in a file believed to have inconsistent encodings
Unicode Understanding Needed: High
Unicode Correctness: Low
Approach: ??? Panic :-)

This is the killer, but should be extremely rare. We don't need to
explain what to do here, but maybe offer a simple strategy (1. Are you
sure the file has mixed encodings? Have you checked twice? 2. If it's
ASCII-compatible, can you work on a basis that you just pass the
mixed-encoding bytes through unchanged? If so use one of the other
recipes Nick explained. 3. Do you care about mojibake or corruption?
Can you afford not to? 4. Are you a Unicode expert, or do you know
one? :-))

I think something like this would be a huge benefit for the Unicode
FAQ. I haven't got the time or expertise to write it, but I wish I
did. If I get some spare time, I might well have a go anyway, but I
can't promise.

Paul