[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Tue Jun 11 21:26:26 CEST 2013

On Sun, Jun 9, 2013 at 3:59 PM, Yuval Greenfield <ubershmekel at gmail.com> wrote:
> On Sun, Jun 9, 2013 at 7:18 PM, Stephen J. Turnbull <stephen at xemacs.org>
> wrote:
>>
>> Yuval Greenfield writes:

>>  > Personally I favor the first because more often than not files
>>  > aren't encoded in the platform's chosen encoding, so it's better to
>>  > be explicit and consistent.

I'm guessing that the exceptions fit into two categories:

(1)  They came from some other system, likely as a saved web page.
or
(2)  They were written by program X, which ignored the system default
in favor of a "better" explicit choice.  Which might actually be
better, so long as you don't try to use them outside of that program.

> Hebrew compatibility has been the nuisance and these are the encodings I had
> to fight:

> utf-8, ucs-2, utf-16, ucs-4, ISO-8859-8, ISO-8859-8-I, Windows-1255.

> It's plagued websites, browsers, email clients,

Unfortunately, even an explicitly declared language and character set
is likely to be false.  Best results were obtained by (sometimes)
ignoring or overriding the explicit definitions, but the precise
details on how to do this changed over time.  The last time I had
looked it up in the HTML5 draft, there were explicit "browser-specific
heuristics" steps.  Today, the majority of encoding determination has
been split off into its own standard (
http://encoding.spec.whatwg.org/ -- last updated in February 2013)
which warns that:

   "In violation of section 1.4 of Unicode Technical Standard #22 this is a
    much simpler and more restrictive matching algorithm, as that is found to
    be necessary to be compatible with deployed content."

The main html standard does still define how to parse a meta charset
element (because that is internal to a document) at
http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#extracting-character-encodings-from-meta-elements
but explicitly warns that this is slightly different from even the
HTTP standard.

ISO-8859-8 and ISO-8859-8-I even get a special mention.

Which is a long-winded way of saying "Sometimes the encoding will be
wrong."  If you could enforce utf-8, you would be fine -- but if you
could do that, then it would already have been the system default.

> So I appreciate an app being consistent and promoting utf-8 more than being
> compliant with the operating system, which the apps I've used don't comply
> with.

So go ahead and explicitly use utf-8 when writing a file, and then use
it again when reading.  And the explicit use will advertise the name
"utf-8" as a good thing.

> Another related annoyance I struggled with recently was that git gives you
> the platform's newline scheme, which means I can't have a git repository in
> a dropbox shared between Windows and Ubuntu without meddling with this stuff
> (the solution is a repo config file).

Would you rather have spurious changes as the newline convention went
back and forth, depending on who edited it last?

-jJ