On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Adam Bartoš writes:

 > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
 > sys.std* streams are created with utf-8 encoding (which doesn't
 > help on Windows since they still don't use ReadConsoleW and
 > WriteConsoleW to communicate with the terminal) and after changing
 > the sys.std* streams to the fixed ones and setting readline hook,
 > it still doesn't work,

I don't see why you would expect it to work: either your code is
bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
matter, or you're feeding already decoded text *as UTF-8* to your
module which evidently expects something else (UTF-16LE?).

I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something), which is UTF-8 encoded input from the user. The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer.

In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(), which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'.

Why is the result different? I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8.

 > so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

I don't think that flag does what you think it does.  AFAICT from
looking at the source, that flag gets unconditionally set in the
execution context for compile, eval, and exec, and it is checked in
the parser when creating an AST node.  So it looks to me like it
asserts that the *internal* representation of the program is UTF-8
*after* transforming the input to an internal representation (doing
charset decoding, removing comments and line continuations, etc).

I thought it might do what I want because of the behaviour of eval. I thought that the PyUnicode_AsUTF8String call in eval just encodes the passed unicode to UTF-8, so the situation looks like follows:
eval(u"u'\u031b'") -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 set) -> u'\u03b1'
eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 not set) -> u'\xce\xb1'
But of course, this my picture might be wrong.

 > Well, the received text comes from sys.stdin and its encoding is
 > known.

How?  You keep asserting this.  *You* know, but how are you passing
that information to *the Python interpreter*?  Guido may have a time
machine, but nobody claims the Python interpreter is telepathic.

I thought that the Python interpreter knows the input comes from sys.stdin at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject the encoding for the tokenizer is inferred from sys.stdin.encoding. But this is actually the case only in Python 3. So I was wrong.

 > Yes. In the latter case, eval has no idea how the bytes given are
 > encoded.

Eval *never* knows how bytes are encoded, not even implicitly.  That's
one of the important reasons why Python 3 was necessary.  I think you
know that, but you don't write like you understand the implications
for your current work, which makes it hard to communicate.

Yes, eval never knows how bytes are encoded. But I meant it in comparison with the first case where a Unicode string was passed.