[Python-Dev] Unicode literals in Python 2.7

Adam Bartoš drekin at gmail.com
Thu Apr 30 17:44:00 CEST 2015


> does this not work for you?
>
> from __future__ import unicode_literals

No, with unicode_literals I just don't have to use the u'' prefix, but the
wrong interpretation persists.


On Thu, Apr 30, 2015 at 3:03 AM, Stephen J. Turnbull <stephen at xemacs.org>
wrote:

>
> IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in
> the environment does what you want.


Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std*
streams are created with utf-8 encoding (which doesn't help on Windows
since they still don't use ReadConsoleW and WriteConsoleW to communicate
with the terminal) and after changing the sys.std* streams to the fixed
ones and setting readline hook, it still doesn't work, so presumably the
PyCF_SOURCE_IS_UTF8 is still not set.



> Regarding your environment, the repeated use of "custom" is a red
> flag.  Unless you bundle your whole environment with the code you
> distribute, Python can know nothing about that.  In general, Python
> doesn't know what encoding it is receiving text in.
>

Well, the received text comes from sys.stdin and its encoding is known.
Ideally, Python would recieve the text as Unicode String object so there
would be no problem with encoding (see
http://bugs.python.org/issue17620#msg234439 ).


If you *do* know, you can set PyCF_SOURCE_IS_UTF8.  So if you know
> that all of your users will have your custom stdio and readline hooks
> installed (AFAICS, they can't use IDLE or IPython!), then you can
> bundle Python built with the flag set, or perhaps you can do the
> decoding in your custom stdio module.
>

The custom stdio streams and readline hooks are set at runtime by a code in
sitecustomize. It does not affect IDLE and it is compatible with IPython. I
would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via
ctypes. But this may be impossible.



> Note that even if you have a UTF-8 input source, some users are likely
> to be surprised because IIRC Python doesn't canonicalize in its
> codecs; that is left for higher-level libraries.  Linux UTF-8 is
> usually NFC normalized, while Mac UTF-8 is NFD normalized.
>

Actually, I have a UTF-16-LE source, but that is not important since it's
decoted to Python Unicode string object. I have this Unicode string and I'm
to return it from the readline hook, but I don't know how to communicate it
to the caller – the tokenizer – so it is interpreted correctly. Note that
the following works:

>>> eval(raw_input('~~> '))
~~> u'α'
u'\u03b1'

Unfortunatelly, the REPL works differently than eval/exec on raw_input. It
seems that the only option is to bypass the REPL by a custom REPL (e.g.
based on code.InteractiveConsole). However, wrapping up the execution of a
script, so that the custom REPL is invoked at the right place, is
complicated.


 > >>> Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin at gmail.com> a écrit :
>  > >>> > Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") ->
>  > >>> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.
>
> Just to be clear, you accept those results as correct, right?
>

Yes. In the latter case, eval has no idea how the bytes given are encoded.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150430/17e9de4f/attachment.html>


More information about the Python-Dev mailing list