[Python-ideas] RFC: PEP 540 version 3 (Add a new UTF-8 mode)
Victor Stinner
victor.stinner at gmail.com
Thu Jan 12 12:10:35 EST 2017
2017-01-12 17:10 GMT+01:00 Oleg Broytman <phd at phdru.name>:
>> Does it work to use a locale with encoding A for LC_CTYPE and a locale
>> with encoding B for LC_MESSAGES (and others)? Is there a risk of
>
> It does when B is a subset of A (ascii and koi8; ascii and utf8, e.g.)
My question is more when A and B encodings are not compatible.
Ah yes, date, thank you for the example. Here is my example using
LC_TIME locale to format a date and LC_CTYPE to decode a byte string:
date.py:
---
import locale, time
locale.setlocale(locale.LC_ALL, "")
b = time.strftime("%a")
encoding=locale.getpreferredencoding()
try:
u = b.decode(encoding)
except UnicodeError:
u = '<failed to decode>'
else:
u = repr(u)
print("bytes: %r, text: %s, encoding: %r" % (b, u, encoding))
---
When all locales are the same, it works fine: 목 (U+baa9) is the expected result
$ LC_TIME=ko_KR.euckr LANG=ko_KR.euckr python2 date.py
bytes: '\xb8\xf1', text: u'\ubaa9', encoding: 'EUC-KR'
You get mojibake if LC_CTYPE uses the Latin1 encoding whereas LC_TIME
uses the EUC-KR encoding: you get "¸ñ" (U+00b8, U+00f1).
$ LC_TIME=ko_KR.euckr LANG=fr_FR python2 date.py
bytes: '\xb8\xf1', text: u'\xb8\xf1', encoding: 'ISO-8859-1'
The program can also fail with UnicodeDecodeError:
$ LC_TIME=ko_KR.euckr LANG=fr_FR.UTF-8 python2 date.py
bytes: '\xb8\xf1', text: <failed to decode>, encoding: 'UTF-8'
Well, since we are talking about the POSIX locale which usually uses
ASCII, it shouldn't be an issue in practice for the PEP 538. I was
just curious :-)
Victor
More information about the Python-ideas
mailing list