[Python-ideas] PEP 540: Add a new UTF-8 mode

Oleg Broytman phd at phdru.name
Fri Jan 6 14:12:16 EST 2017


On Fri, Jan 06, 2017 at 10:15:52AM +0900, INADA Naoki <songofacandy at gmail.com> wrote:
> >> Always use UTF-8
> >> ----------------
> >>
> >> Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
> >> Since UTF-8 became the defacto encoding, it makes sense to always use it on all
> >> platforms with any locale.
> >
> >    Please don't! I use different locales and encodings, sometimes it's
> > utf-8, sometimes not - but I have properly configured LC_* settings and
> > I prefer Python to follow my command. It'd be disgusting if Python
> > starts to bend me to its preferences.
> 
> For stdio (including console), PYTHONIOENCODING can be used for
> supporting legacy system.
> e.g. `export PYTHONIOENCODING=$(locale charmap)`

   This means one more thing to reconfigure when I switch locales
instead of Python to catches up automatically.

> For commandline argument and filepath, UTF-8/surrogateescape can round trip.
> But mojibake may happens when pass the path to GUI.
> 
> If we chose "Always use UTF-8 for fs encoding", I think
> PYTHONFSENCODING envvar should be
> added again.  (It should be used from startup: decoding command line argument).
> 
> >
> >> The risk is to introduce mojibake if the locale uses a different encoding,
> >> especially for locales other than the POSIX locale.
> >
> >    There is no such risk for me as I already have mojibake in my
> > systems. Two most notable sources of mojibake are:
> >
> > 1) FTP servers - people create files (both names and content) in
> >    different encodings; w32 FTP clients usually send file names and
> >    content in cp1251 (Russian Windows encoding), sometimes in cp866
> >    (Russian Windows OEM encoding).
> >
> > 2) MP3 tags and play lists - almost always cp1251.
> >
> >    So whatever my personal encoding is - koi8-r or utf-8 - I have to
> > deal with file names and content in different encodings.
> 
> 3) unzip zip file sent by Windows.   Windows user use no-ASCII filenames, and
> create legacy (no UTF-8) zip file very often.

   Good example, thank you! I forgot about it because I have wrote my
own zip.py and unzip.py that encode/decode filenames.

> I think people using non UTF-8 should solve encoding issue by themselves.
> People should use ASCII or UTF-8 always if they don't want to see mojibake.

   Impossible. Even if I'd always use UTF-8 I still will receive a lot
of cp1251/cp866.

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.


More information about the Python-ideas mailing list