<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <span dir="ltr"><<a href="mailto:stephen@xemacs.org" target="_blank">stephen@xemacs.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>Adam Bartoš writes:<br>

<br>

 > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the<br>

 > sys.std* streams are created with utf-8 encoding (which doesn't<br>

 > help on Windows since they still don't use ReadConsoleW and<br>

 > WriteConsoleW to communicate with the terminal) and after changing<br>

 > the sys.std* streams to the fixed ones and setting readline hook,<br>

 > it still doesn't work,<br>

<br>

</span>I don't see why you would expect it to work: either your code is<br>

bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't<br>

matter, or you're feeding already decoded text *as UTF-8* to your<br>

module which evidently expects something else (UTF-16LE?).<span><br></span></blockquote><div><br></div><div>I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something), which is UTF-8 encoded input from the user. The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer.<br><br></div><div>In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(), which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'.<br><br></div><div>Why is the result different? I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>

 > so presumably the PyCF_SOURCE_IS_UTF8 is still not set.<br>

<br>

</span>I don't think that flag does what you think it does.  AFAICT from<br>

looking at the source, that flag gets unconditionally set in the<br>

execution context for compile, eval, and exec, and it is checked in<br>

the parser when creating an AST node.  So it looks to me like it<br>

asserts that the *internal* representation of the program is UTF-8<br>

*after* transforming the input to an internal representation (doing<br>

charset decoding, removing comments and line continuations, etc).<span><br></span></blockquote><div><br></div><div>I thought it might do what I want because of the behaviour of eval. I thought that the PyUnicode_AsUTF8String call in eval just encodes the passed unicode to UTF-8, so the situation looks like follows:<br></div><div>eval(u"u'\u031b'") -> (b"u'\xce\xb1'", <span>PyCF_SOURCE_IS_UTF8 set</span>) -> u'\u03b1'<br></div><div>eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", <span>PyCF_SOURCE_IS_UTF8 not set) -> u'\xce\xb1'<br></span></div><div><span>But of course, this my picture might be wrong.<br></span></div><div><br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>

 > Well, the received text comes from sys.stdin and its encoding is<br>

 > known.<br>

<br>

</span>How?  You keep asserting this.  *You* know, but how are you passing<br>

that information to *the Python interpreter*?  Guido may have a time<br>

machine, but nobody claims the Python interpreter is telepathic.<span><br></span></blockquote><div><br></div><div>I thought that the Python interpreter knows the input comes from sys.stdin at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject the encoding for the tokenizer is inferred from sys.stdin.encoding. But this is actually the case only in Python 3. So I was wrong. <br></div><div><br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>

</span><span> > Yes. In the latter case, eval has no idea how the bytes given are<br>

 > encoded.<br>

<br>

</span>Eval *never* knows how bytes are encoded, not even implicitly.  That's<br>

one of the important reasons why Python 3 was necessary.  I think you<br>

know that, but you don't write like you understand the implications<br>

for your current work, which makes it hard to communicate.<br></blockquote><div><br></div><div>Yes, eval never knows how bytes are encoded. But I meant it in comparison with the first case where a Unicode string was passed.<br></div></div><br></div></div>