[Python-ideas] PEP 540: Add a new UTF-8 mode

Wed Jan 11 05:46:24 EST 2017

Hi INADA Naoki,

(Sorry, I am unsure if INADA or Naoki is your first name...)

While I am very much in favour of everything working "out of the box",
an issue is that we don't have control over external code
(be it Python extensions or external processes invoked from Python).

And that code will only look at LANG/LC_TYPE and ignore any cleverness
we build into Python.

For example, this may mean that a built-in Python string sort will give you
a different ordering than invoking the external "sort" command.
I have been bitten by this kind of issues, leading to spurious "diffs" if
you try to use sorting to put strings into a canonical order.

So my feeling is that people are ultimately not being helped by
Python trying to be "nice", since they will be bitten by locale issues
anyway. IMHO ultimately better to educate them to configure the locale.
(I realise that people may reasonably disagree with this assessment ;-) )

I would then recommend to set to en_US.UTF-8, which is slower and
less elegant but at least more widely supported.

By the way, I know a bit how Node.js deals with locales, and it doesn't try
to compensate for "C" locales either. But what it *does* do is that
Node never uses the locale settings to determine the encoding of a file:
you either have to specify it explicitly OR it defaults to UTF-8 (the
latter on output only).
So in this respect it is by specification immune against misconfiguration
of the encoding.
However, other stuff (e.g. date formatting) will still be influenced by the
"C" locale
as usual.

Stephan

2017-01-11 9:17 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:

> Here is one example of locale pitfall.
>
> ---
> # from http://unix.stackexchange.com/questions/169739/why-is-
> coreutils-sort-slower-than-python
>
> $ cat letters.py
> import string
> import random
>
> def main():
>     for _ in range(1_000_000):
>         c = random.choice(string.ascii_letters)
>         print(c)
>
> main()
>
> $ python3 letters.py > letters.txt
>
> $ LC_ALL=C time sort letters.txt > /dev/null
>         0.35 real         0.32 user         0.02 sys
>
> $ LC_ALL=C.UTF-8 time sort letters.txt > /dev/null
>         0.36 real         0.33 user         0.02 sys
>
> $ LC_ALL=ja_JP.UTF-8 time sort letters.txt > /dev/null
>        11.03 real        10.95 user         0.04 sys
>
> $ LC_ALL=en_US.UTF-8 time sort letters.txt > /dev/null
>        11.05 real        10.97 user         0.04 sys
> ---
>
> This is why some engineer including me use C locale on Linux,
> at least when there are no C.UTF-8 locale.
>
> Off course, we can use LC_CTYPE=en_US.UTF-8, instead of LANG or LC_ALL.
> (I wonder if we can use LC_CTYPE=UTF-8...)
>
> But I dislike current situation that "people should learn
> how to configure locale properly, and pitfall of non-C locale, only for
> using UTF-8 on Python".
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170111/14a117e5/attachment.html>