> does this not work for you?
>
> from __future__ import unicode_literals

No, with unicode_literals I just don't have to use the u'' prefix, but the wrong interpretation persists.


On Thu, Apr 30, 2015 at 3:03 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in
the environment does what you want. 

Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std* streams are created with utf-8 encoding (which doesn't help on Windows since they still don't use ReadConsoleW and WriteConsoleW to communicate with the terminal) and after changing the sys.std* streams to the fixed ones and setting readline hook, it still doesn't work, so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

 
Regarding your environment, the repeated use of "custom" is a red
flag.  Unless you bundle your whole environment with the code you
distribute, Python can know nothing about that.  In general, Python
doesn't know what encoding it is receiving text in.

Well, the received text comes from sys.stdin and its encoding is known. Ideally, Python would recieve the text as Unicode String object so there would be no problem with encoding (see http://bugs.python.org/issue17620#msg234439 ).
 

If you *do* know, you can set PyCF_SOURCE_IS_UTF8.  So if you know
that all of your users will have your custom stdio and readline hooks
installed (AFAICS, they can't use IDLE or IPython!), then you can
bundle Python built with the flag set, or perhaps you can do the
decoding in your custom stdio module.

The custom stdio streams and readline hooks are set at runtime by a code in sitecustomize. It does not affect IDLE and it is compatible with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via ctypes. But this may be impossible.

 
Note that even if you have a UTF-8 input source, some users are likely
to be surprised because IIRC Python doesn't canonicalize in its
codecs; that is left for higher-level libraries.  Linux UTF-8 is
usually NFC normalized, while Mac UTF-8 is NFD normalized.

Actually, I have a UTF-16-LE source, but that is not important since it's decoted to Python Unicode string object. I have this Unicode string and I'm to return it from the readline hook, but I don't know how to communicate it to the caller – the tokenizer – so it is interpreted correctly. Note that the following works:

>>> eval(raw_input('~~> '))
~~> u'α'
u'\u03b1'

Unfortunatelly, the REPL works differently than eval/exec on raw_input. It seems that the only option is to bypass the REPL by a custom REPL (e.g. based on code.InteractiveConsole). However, wrapping up the execution of a script, so that the custom REPL is invoked at the right place, is complicated.


 > >>> Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :
 > >>> > Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") ->
 > >>> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

Just to be clear, you accept those results as correct, right?

Yes. In the latter case, eval has no idea how the bytes given are encoded.