[Python-ideas] PEP 540: Add a new UTF-8 mode

Stephan Houben stephanh42 at gmail.com
Thu Jan 12 07:36:35 EST 2017


Hi Petr,

2017-01-11 12:22 GMT+01:00 Petr Viktorin <encukou at gmail.com>:
>
> For example, this may mean that a built-in Python string sort will give you
>> a different ordering than invoking the external "sort" command.
>> I have been bitten by this kind of issues, leading to spurious "diffs" if
>> you try to use sorting to put strings into a canonical order.
>>
>
> AFAIK, this would not be a problem under PEP 538, which effectively treats
> the "C" locale as "C.UTF-8". Strings of Unicode codepoints and the
> corresponding UTF-8-encoded bytes sort the same way.
>

...and this is also something new I learned.


>
> Is that wrong, or do you have a better example of trouble with using
> "C.UTF-8" instead of "C"?



After long deliberation, it seems I cannot find any source of trouble. +1

So my feeling is that people are ultimately not being helped by
>> Python trying to be "nice", since they will be bitten by locale issues
>> anyway. IMHO ultimately better to educate them to configure the locale.
>> (I realise that people may reasonably disagree with this assessment ;-) )
>>
>> I would then recommend to set to en_US.UTF-8, which is slower and
>> less elegant but at least more widely supported.
>>
>
> What about the spurious diffs you'd get when switching from "C" to
> "en_US.UTF-8"?
>

That taught me to explicitly invoke "sort" using
LANG=en_US.UTF-8 sort


>
> I believe the main problem is that the "C" locale really means two very
> different things:
>
> a) Text is encoded as 7-bit ASCII; higher codepoints are an error
> b) No encoding was specified
>
> In both cases, treating "C" as "C.UTF-8" is not bad:
> a) For 7-bit "text", there's no real difference between these locales
> b) UTF-8 is a much better default than ASCII
>
>
A "C" locale also means that a program should not *output* non-ASCII
characters,
unless when explicitly being fed in (like in the case of "cat" or "sort" or
the "ls" equivalent from PEP-540).

So for example, a program might want to print fancy Unicode box characters
to show
progress, and check sys.stderr.encoding to see if it can do so.
However, under a "C" locale it should not do so since for example the
terminal
is unlikely to display the fancy box characters properly.

Note that the current PEP 540 proposal would be that sys.stderr is in UTF-8
/backslashreplace encoding
under the "C" locale.

I think this may be a minor concern ultimately, but it would be nice if we
had some API to
at least reliable answer the question "can I safely output non-ASCII
myself?"

Stephan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170112/76723b41/attachment.html>


More information about the Python-ideas mailing list