[Python-ideas] PEP 540: Add a new UTF-8 mode

Wed Jan 11 06:22:57 EST 2017

On 01/11/2017 11:46 AM, Stephan Houben wrote:
> Hi INADA Naoki,
>
> (Sorry, I am unsure if INADA or Naoki is your first name...)
>
> While I am very much in favour of everything working "out of the box",
> an issue is that we don't have control over external code
> (be it Python extensions or external processes invoked from Python).
>
> And that code will only look at LANG/LC_TYPE and ignore any cleverness
> we build into Python.
>
> For example, this may mean that a built-in Python string sort will give you
> a different ordering than invoking the external "sort" command.
> I have been bitten by this kind of issues, leading to spurious "diffs" if
> you try to use sorting to put strings into a canonical order.

AFAIK, this would not be a problem under PEP 538, which effectively 
treats the "C" locale as "C.UTF-8". Strings of Unicode codepoints and 
the corresponding UTF-8-encoded bytes sort the same way.

Is that wrong, or do you have a better example of trouble with using 
"C.UTF-8" instead of "C"?

> So my feeling is that people are ultimately not being helped by
> Python trying to be "nice", since they will be bitten by locale issues
> anyway. IMHO ultimately better to educate them to configure the locale.
> (I realise that people may reasonably disagree with this assessment ;-) )
>
> I would then recommend to set to en_US.UTF-8, which is slower and
> less elegant but at least more widely supported.

What about the spurious diffs you'd get when switching from "C" to 
"en_US.UTF-8"?

$ LC_ALL=en_US.UTF-8 sort file.txt
a
a
A
A
$ LC_ALL=C sort file.txt
A
A
a
a

> By the way, I know a bit how Node.js deals with locales, and it doesn't try
> to compensate for "C" locales either. But what it *does* do is that
> Node never uses the locale settings to determine the encoding of a file:
> you either have to specify it explicitly OR it defaults to UTF-8 (the
> latter on output only).
> So in this respect it is by specification immune against
> misconfiguration of the encoding.
> However, other stuff (e.g. date formatting) will still be influenced by
> the "C" locale
> as usual.

I believe the main problem is that the "C" locale really means two very 
different things:

a) Text is encoded as 7-bit ASCII; higher codepoints are an error
b) No encoding was specified

In both cases, treating "C" as "C.UTF-8" is not bad:
a) For 7-bit "text", there's no real difference between these locales
b) UTF-8 is a much better default than ASCII