[Python-Dev] Python3 "complexity" - 2 use cases

Fri Jan 10 22:23:18 CET 2014

> Steven D'Aprano wrote:
>> I think that heuristics to guess the encoding have their role to play,
>> if the caller understands the risks.

Ben Finney wrote:
> In my opinion, content-type guessing heuristics certainly don't belong
> in the standard library.

It would be great if there were never any need to guess.  But in the
real world, there is -- and often the user won't know any more than
python does.  So when it is time to guess, a source of good guesses
is an important battery to include.

The HTML5 specifications go through some fairly extreme contortions
to document what browsers actually do, as opposed to what previous
standards have mandated.  They don't currently specify how to guess
(though I think a draft once tried, since the major browsers all do
it, and at the time did it similarly), but the specs do explicitly
support such a step, and do provide an implementation note
encouraging user-agents to do at least minimal auto-detection.  

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

My own opinion is therefore that Python SHOULD provide better support
for both of the following use cases:

    (1)  Treat this file like it came from the web -- including
         autodetection and even overriding explicit charset
         declarations for certain charsets.

    We should explicitly treat autodetection like time zone data --
    there is no promise that the "right answer" (or at least the
    "best guess") won't change, even within a release.

    I offer no opinion on whether chardet in particular is still
    too volatile, but the docs should warn that the API is driven
    by possibly changing external data.

    (2)  Treat this file as "ASCII+", where anything non-ASCII
         will (at most) be written back out unchanged; it doesn't
         even need to be converted to text.

    At this time, I don't know whether the right answer is making it
    easy to default to surrogate-escape for all error-handling, 
    adding more bytes methods, encouraging use of python's latin-1
    variant, offering a dedicated (new?) codec, or some new suggestion.

    I do know that this use case is important, and that python 3
    currently looks clumsy compared to python 2.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ