[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Yuval Greenfield ubershmekel at gmail.com
Sun Jun 9 21:59:00 CEST 2013


On Sun, Jun 9, 2013 at 7:18 PM, Stephen J. Turnbull <stephen at xemacs.org>wrote:

> Yuval Greenfield writes:
>
>  > Personally I favor the first because more often than not files
>  > aren't encoded in the platform's chosen encoding, so it's better to
>  > be explicit and consistent.
>
> I've been doing development of multilingual and multiscript software
> for two decades.  As much as I'd like to agree with you, in my
> experience you're wrong by a large factor where it matters: text files
> where there's an issue of "guessing" the encoding.  Those are far more
> often than not encoded in the platform's default encoding.
>
>
I'm always glad to learn and agree to disagree. I'm only 8 years in the
"development of multilingual and multiscript software". Living in Israel -
Hebrew compatibility has been the nuisance and these are the encodings I
had to fight:

utf-8, ucs-2, utf-16, ucs-4, ISO-8859-8, ISO-8859-8-I, Windows-1255.

It's plagued websites, browsers, email clients, adobe photoshop and
premiere, excel, word, and powerpoint. It's always been a guessing game
when a friend would call for help proclaiming "all I'm getting is Chinese"
which is the written gibberish euphemism used around here. Sometimes it's
just the word or letter ordering that's messed up (Hebrew is an RTL
language). Most Israelis have experienced and fear this phenomenon.

If I were to try and fix a problem I'd either be using notepad with its
heuristics or iterating through the above options.

Sometimes the above encodings were the platform's (windows') default
encoding, but in my experience it was mainly applications or websites that
chose their encoding for whatever reasons. E.g. Windows Internals 4th
edition promoted ucs-2 as the killer encoding that all windows applications
should be implemented with. Though I remember a VBScript of a friend
spawning a ucs-4 csv file that turned into Chinese when opened in Excel. I
did not check which one of those if any was the system default encoding at
the time.

So I appreciate an app being consistent and promoting utf-8 more than being
compliant with the operating system, which the apps I've used don't comply
with.

Another related annoyance I struggled with recently was that git gives you
the platform's newline scheme, which means I can't have a git repository in
a dropbox shared between Windows and Ubuntu without meddling with this
stuff (the solution is a repo config file).

If there's no guessing involved, then explicitly specifying the known
> encoding is an inconvenience, indeed.  But is it really that big an
> inconvenience?
>
>
It's perfectly fine. Perhaps you guys are used to more os-encoding-abiding
applications and value that quality. That kind of consistency indeed would
have saved me from at least some heart ache. I just wish we can get rid of
these problems for good, and promoting utf-8 everywhere is one way to go
about it.


Yuval
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/e3f0707e/attachment.html>


More information about the Python-ideas mailing list