Mailman 3 Unicode literals in Python 2.7 - Python-Dev

newer
What's missing in PEP-484 (Type...

Unicode literals in Python 2.7

Adam Bartoš

28 Apr 2015 28 Apr '15

8:20 p.m.

Hello, is it possible to somehow tell Python 2.7 to compile a code entered in the interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering adding support for Python 2 in my package ( https://github.com/Drekin/win-unicode-console) and I have run into the fact that when u"α" is entered in the interactive session, it results in u"\xce\xb1" rather than u"\u03b1". As this seems to be a highly specialized question, I'm asking it here. Regards, Drekin

Attachments:

attachment.htm (text/html — 626 bytes)

Show replies by date

Alexander Walters

28 Apr 28 Apr

8:22 p.m.

does this not work for you? from __future__ import unicode_literals On 4/28/2015 16:20, Adam Bartoš wrote:

...

Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in the interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering adding support for Python 2 in my package (https://github.com/Drekin/win-unicode-console) and I have run into the fact that when u"α" is entered in the interactive session, it results in u"\xce\xb1" rather than u"\u03b1". As this seems to be a highly specialized question, I'm asking it here.

Regards, Drekin

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/tritium-list%40sdamon.com

Nick Coghlan

29 Apr 29 Apr

7:25 a.m.

On 29 April 2015 at 06:20, Adam Bartoš <drekin@gmail.com> wrote:

...

Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in the interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering adding support for Python 2 in my package (https://github.com/Drekin/win-unicode-console) and I have run into the fact that when u"α" is entered in the interactive session, it results in u"\xce\xb1" rather than u"\u03b1". As this seems to be a highly specialized question, I'm asking it here.

As far as I am aware, we don't have the equivalent of a "coding cookie" for the interactive interpreter, so if anyone else knows how to do it, I'll be learning something too :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Adam Bartoš

8:35 a.m.

This situation is a bit different from coding cookies. They are used when we have bytes from a source file, but we don't know its encoding. During interactive session the tokenizer always knows the encoding of the bytes. I would think that in the case of interactive session the PyCF_SOURCE_IS_UTF8 should be always set so the bytes containing encoded non-ASCII characters are interpreted correctly. Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'. I understand that in the second case eval has no idea how are the given bytes encoded. But the first case is actually implemented by encoding to utf-8 and setting PyCF_SOURCE_IS_UTF8. That's why I'm talking about the flag. Regards, Drekin On Wed, Apr 29, 2015 at 9:25 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

...
Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in

On 29 April 2015 at 06:20, Adam Bartoš <drekin@gmail.com> wrote: the

...
interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering adding support for Python 2 in my package (https://github.com/Drekin/win-unicode-console) and I have run into the fact that when u"α" is entered in the interactive session, it results in u"\xce\xb1" rather than u"\u03b1". As this seems to be a highly specialized question, I'm asking it here.

As far as I am aware, we don't have the equivalent of a "coding cookie" for the interactive interpreter, so if anyone else knows how to do it, I'll be learning something too :)

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Victor Stinner

2:59 p.m.

Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...

Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

There is a simple option to get this flag: call eval() with unicode, not with encoded bytes. Victor

Adam Bartoš

4:19 p.m.

Yes, that works for eval. But I want it for code entered during an interactive session.

...

...
...
u'α' u'\xce\xb1'

The tokenizer gets b"u'\xce\xb1'" by calling PyOS_Readline and it knows it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of how eval works, I believe that it would work correctly if the PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there is a way to set it. Also, my naive thought is that it should be always set in the case of interactive session. On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

There is a simple option to get this flag: call eval() with unicode, not with encoded bytes.

Victor

Guido van Rossum

4:40 p.m.

I suspect the interactive session is *not* always in UTF8. It probably depends on the keyboard mapping of your terminal emulator. I imagine in Windows it's the current code page. On Wed, Apr 29, 2015 at 9:19 AM, Adam Bartoš <drekin@gmail.com> wrote:

...

Yes, that works for eval. But I want it for code entered during an interactive session.

...
...
...
u'α' u'\xce\xb1'

The tokenizer gets b"u'\xce\xb1'" by calling PyOS_Readline and it knows it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of how eval works, I believe that it would work correctly if the PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there is a way to set it. Also, my naive thought is that it should be always set in the case of interactive session.

On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

There is a simple option to get this flag: call eval() with unicode, not with encoded bytes.

Victor

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Oleg Broytman

5:40 p.m.

On Wed, Apr 29, 2015 at 09:40:43AM -0700, Guido van Rossum <guido@python.org> wrote:

...

I suspect the interactive session is *not* always in UTF8. It probably depends on the keyboard mapping of your terminal emulator. I imagine in Windows it's the current code page.

Even worse: in w32 it can be an OEM codepage. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Adam Bartoš

7:18 p.m.

I am in Windows and my terminal isn't utf-8 at the beginning, but I install custom sys.std* objects at runtime and I also install custom readline hook, so the interactive loop gets the input from my stream objects via PyOS_Readline. So when I enter u'α', the tokenizer gets b"u'\xce\xb1'", which is the string encoded in utf-8, and sys.stdin.encoding == 'utf-8'. However, the input is then interpreted as u'\xce\xb1' instead of u'\u03b1'. On Wed, Apr 29, 2015 at 6:40 PM, Guido van Rossum <guido@python.org> wrote:

...

I suspect the interactive session is *not* always in UTF8. It probably depends on the keyboard mapping of your terminal emulator. I imagine in Windows it's the current code page.

On Wed, Apr 29, 2015 at 9:19 AM, Adam Bartoš <drekin@gmail.com> wrote:

...
Yes, that works for eval. But I want it for code entered during an interactive session.

...
...
...
u'α' u'\xce\xb1'

The tokenizer gets b"u'\xce\xb1'" by calling PyOS_Readline and it knows it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of how eval works, I believe that it would work correctly if the PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there is a way to set it. Also, my naive thought is that it should be always set in the case of interactive session.

On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner <victor.stinner@gmail.com

...
wrote:

...
Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

There is a simple option to get this flag: call eval() with unicode, not with encoded bytes.

Victor

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Stephen J. Turnbull

30 Apr 30 Apr

1:03 a.m.

Adam Bartoš writes:

...

I am in Windows and my terminal isn't utf-8 at the beginning, but I install custom sys.std* objects at runtime and I also install custom readline hook,

IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in the environment does what you want. (Can't test at the moment, I'm on a Mac and Terminal.app somehow fails to pass the right thing to Python from the input methods I have available -- I get an empty string, while I don't seem to have an uxterm, only an xterm.) This has to be set at interpreter startup; once the interpreter has decided its IO encoding, you can't change it, you can only override it by intercepting the console input and decoding it yourself. Regarding your environment, the repeated use of "custom" is a red flag. Unless you bundle your whole environment with the code you distribute, Python can know nothing about that. In general, Python doesn't know what encoding it is receiving text in. If you *do* know, you can set PyCF_SOURCE_IS_UTF8. So if you know that all of your users will have your custom stdio and readline hooks installed (AFAICS, they can't use IDLE or IPython!), then you can bundle Python built with the flag set, or perhaps you can do the decoding in your custom stdio module. Note that even if you have a UTF-8 input source, some users are likely to be surprised because IIRC Python doesn't canonicalize in its codecs; that is left for higher-level libraries. Linux UTF-8 is usually NFC normalized, while Mac UTF-8 is NFD normalized.

...

...
...
u'\xce\xb1'

Note that that is perfectly legal Unicode.

...

...
...
...
Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

Just to be clear, you accept those results as correct, right?

Chris Angelico

1:43 a.m.

On Thu, Apr 30, 2015 at 11:03 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

Note that even if you have a UTF-8 input source, some users are likely to be surprised because IIRC Python doesn't canonicalize in its codecs; that is left for higher-level libraries. Linux UTF-8 is usually NFC normalized, while Mac UTF-8 is NFD normalized.

...
...
...
u'\xce\xb1'

Note that that is perfectly legal Unicode.

It's legal Unicode, but it doesn't mean what he typed in. This means: '\xce' LATIN CAPITAL LETTER I WITH CIRCUMFLEX '\xb1' PLUS-MINUS SIGN but the original input was: '\u03b1' GREEK SMALL LETTER ALPHA ChrisA

Stephen J. Turnbull

8:18 a.m.

Chris Angelico writes:

...

It's legal Unicode, but it doesn't mean what he typed in.

Of course, that's obvious. My point is "Welcome to the wild wacky world of soi-disant 'internationalized' software, where what you see is what you get regardless of what you type."

Adam Bartoš

3:44 p.m.

...

does this not work for you?

from __future__ import unicode_literals

No, with unicode_literals I just don't have to use the u'' prefix, but the wrong interpretation persists. On Thu, Apr 30, 2015 at 3:03 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in the environment does what you want.

Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std* streams are created with utf-8 encoding (which doesn't help on Windows since they still don't use ReadConsoleW and WriteConsoleW to communicate with the terminal) and after changing the sys.std* streams to the fixed ones and setting readline hook, it still doesn't work, so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

...

Regarding your environment, the repeated use of "custom" is a red flag. Unless you bundle your whole environment with the code you distribute, Python can know nothing about that. In general, Python doesn't know what encoding it is receiving text in.

Well, the received text comes from sys.stdin and its encoding is known. Ideally, Python would recieve the text as Unicode String object so there would be no problem with encoding (see http://bugs.python.org/issue17620#msg234439 ). If you *do* know, you can set PyCF_SOURCE_IS_UTF8. So if you know

...

that all of your users will have your custom stdio and readline hooks installed (AFAICS, they can't use IDLE or IPython!), then you can bundle Python built with the flag set, or perhaps you can do the decoding in your custom stdio module.

The custom stdio streams and readline hooks are set at runtime by a code in sitecustomize. It does not affect IDLE and it is compatible with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via ctypes. But this may be impossible.

...

Note that even if you have a UTF-8 input source, some users are likely to be surprised because IIRC Python doesn't canonicalize in its codecs; that is left for higher-level libraries. Linux UTF-8 is usually NFC normalized, while Mac UTF-8 is NFD normalized.

Actually, I have a UTF-16-LE source, but that is not important since it's decoted to Python Unicode string object. I have this Unicode string and I'm to return it from the readline hook, but I don't know how to communicate it to the caller – the tokenizer – so it is interpreted correctly. Note that the following works:

...

...
...
eval(raw_input('~~> ')) ~~> u'α' u'\u03b1'

Unfortunatelly, the REPL works differently than eval/exec on raw_input. It seems that the only option is to bypass the REPL by a custom REPL (e.g. based on code.InteractiveConsole). However, wrapping up the execution of a script, so that the custom REPL is invoked at the right place, is complicated.

...

...
...
...
Le 29 avr. 2015 10:36, "Adam Bartoš" <drekin@gmail.com> a écrit :

...
...
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(u"u'\u03b1'") -> u'\u03b1' but eval(u"u'\u03b1'".encode('utf-8')) -> u'\xce\xb1'.

Just to be clear, you accept those results as correct, right?

Yes. In the latter case, eval has no idea how the bytes given are encoded.

Stephen J. Turnbull

1 May 1 May

4:14 a.m.

Adam Bartoš writes:

...

Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std* streams are created with utf-8 encoding (which doesn't help on Windows since they still don't use ReadConsoleW and WriteConsoleW to communicate with the terminal) and after changing the sys.std* streams to the fixed ones and setting readline hook, it still doesn't work,

I don't see why you would expect it to work: either your code is bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't matter, or you're feeding already decoded text *as UTF-8* to your module which evidently expects something else (UTF-16LE?).

...

so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

I don't think that flag does what you think it does. AFAICT from looking at the source, that flag gets unconditionally set in the execution context for compile, eval, and exec, and it is checked in the parser when creating an AST node. So it looks to me like it asserts that the *internal* representation of the program is UTF-8 *after* transforming the input to an internal representation (doing charset decoding, removing comments and line continuations, etc).

...

...
Regarding your environment, the repeated use of "custom" is a red flag. Unless you bundle your whole environment with the code you distribute, Python can know nothing about that. In general, Python doesn't know what encoding it is receiving text in.

Well, the received text comes from sys.stdin and its encoding is known.

How? You keep asserting this. *You* know, but how are you passing that information to *the Python interpreter*? Guido may have a time machine, but nobody claims the Python interpreter is telepathic.

...

Ideally, Python would recieve the text as Unicode String object so there would be no problem with encoding

Forget "ideal". Python 3 was created (among other reasons) to get closer to that ideal. But programs in Python 2 are received as str, which is bytes in an ASCII-compatible encoding, not unicode (unless otherwise specified by PYTHONIOENCODING or a coding cookie in a source file, and as far as I know that's the only ways to specify source encoding). This specification of "Python program" isn't going to change in Python 2; that's one of the major unfixable reasons that Python 2 and Python 3 will be incompatible forever.

...

The custom stdio streams and readline hooks are set at runtime by a code in sitecustomize. It does not affect IDLE and it is compatible with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via ctypes. But this may be impossible.

...

Yes. In the latter case, eval has no idea how the bytes given are encoded.

Eval *never* knows how bytes are encoded, not even implicitly. That's one of the important reasons why Python 3 was necessary. I think you know that, but you don't write like you understand the implications for your current work, which makes it hard to communicate.

Adam Bartoš

9:43 a.m.

On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

Adam Bartoš writes:

...
Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std* streams are created with utf-8 encoding (which doesn't help on Windows since they still don't use ReadConsoleW and WriteConsoleW to communicate with the terminal) and after changing the sys.std* streams to the fixed ones and setting readline hook, it still doesn't work,

I don't see why you would expect it to work: either your code is bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't matter, or you're feeding already decoded text *as UTF-8* to your module which evidently expects something else (UTF-16LE?).

I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something), which is UTF-8 encoded input from the user. The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer. In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(), which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'. Why is the result different? I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8.

...

...
so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

I don't think that flag does what you think it does. AFAICT from looking at the source, that flag gets unconditionally set in the execution context for compile, eval, and exec, and it is checked in the parser when creating an AST node. So it looks to me like it asserts that the *internal* representation of the program is UTF-8 *after* transforming the input to an internal representation (doing charset decoding, removing comments and line continuations, etc).

I thought it might do what I want because of the behaviour of eval. I thought that the PyUnicode_AsUTF8String call in eval just encodes the passed unicode to UTF-8, so the situation looks like follows: eval(u"u'\u031b'") -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 set) -> u'\u03b1' eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 not set) -> u'\xce\xb1' But of course, this my picture might be wrong.

...

Well, the received text comes from sys.stdin and its encoding is

...
known.

How? You keep asserting this. *You* know, but how are you passing that information to *the Python interpreter*? Guido may have a time machine, but nobody claims the Python interpreter is telepathic.

I thought that the Python interpreter knows the input comes from sys.stdin at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject the encoding for the tokenizer is inferred from sys.stdin.encoding. But this is actually the case only in Python 3. So I was wrong.

...

Yes. In the latter case, eval has no idea how the bytes given are

...
encoded.

Eval *never* knows how bytes are encoded, not even implicitly. That's one of the important reasons why Python 3 was necessary. I think you know that, but you don't write like you understand the implications for your current work, which makes it hard to communicate.

Yes, eval never knows how bytes are encoded. But I meant it in comparison with the first case where a Unicode string was passed.

Stephen J. Turnbull

2 May 2 May

4:02 p.m.

Adam Bartoš writes:

...

I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something),

It's char*, according to Parser/myreadline.c. It is not str in Python 2.

...

which is UTF-8 encoded input from the user.

By default, it's just ASCII-compatible bytes. I don't know offhand where, but somehow PYTHONIOENCODING tells Python it's UTF-8 -- that's how Python knows about it in this situation.

...

The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer.

In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(),

This is your custom version?

...

which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'.

Why is the result different?

The result is different because Python doesn't "learn" that the actual encoding is UTF-8. If you have tried setting PYTHONIOENCODING=utf-8 with your setup and that doesn't work, I'm not sure where the communication is failing. The only other thing I can think of is to set the encoding sys.stdin.encoding. That may be readonly, though (that would explain why the only way to set the PYTHONIOENCODING is via an environment variable). At least you could find out what it is, with and without PYTHONIOENCODING set to 'utf-8' (or maybe it's 'utf8' or 'UTF-8' -- all work as expected with unicode.encode/str.decode on Mac OS X). Or it could be unimplemented in your replacement module.

...

I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8.

No. PyCF_SOURCE_IS_UTF8 is set unconditionally in the functions builtin_{eval,exec,compile}_impl in Python/bltins.c in the cases that matter AFAICS. It's not obvious to me under what conditions it might *not* be set. It is then consulted in ast.c in PyAST_FromNodeObject, and nowhere else that I can see.

Adam Bartoš

7:57 p.m.

I think I have found out where the problem is. In fact, the encoding of the interactive input is determined by sys.stdin.encoding, but only in the case that it is a file object (see https://hg.python.org/cpython/file/d356e68de236/Parser/tokenizer.c#l890 and the implementation of tok_stdin_decode). For example, by default on my system sys.stdin has encoding cp852.

...

...
...
u'á' u'\xe1' # correct import sys; sys.stdin = "foo" u'á' u'\xa0' # incorrect

Even if sys.stdin contained a file-like object with proper encoding attribute, it wouldn't work since sys.stdin has to be instance of <type 'file'>. So the question is, whether it is possible to make a file instance in Python that is also customizable so it may call my code. For the first thing, how to change the value of encoding attribute of a file object.

"Martin v. Löwis"

7 May 7 May

7:23 p.m.

Am 02.05.15 um 21:57 schrieb Adam Bartoš:

...

Even if sys.stdin contained a file-like object with proper encoding attribute, it wouldn't work since sys.stdin has to be instance of <type 'file'>. So the question is, whether it is possible to make a file instance in Python that is also customizable so it may call my code. For the first thing, how to change the value of encoding attribute of a file object.

If, by "in Python", you mean both "in pure Python", and "in Python 2", then the answer is no. If you can add arbitrary C code, then you might be able to hack your C library's stdio implementation to delegate fread calls to your code. I recommend to use Python 3 instead. Regards, Martin

Adam Bartoš

9 May 9 May

12:39 p.m.

I already have a solution in Python 3 (see https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console), I was just considering adding support for Python 2 as well. I think I have an working example in Python 2 using ctypes. On Thu, May 7, 2015 at 9:23 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Am 02.05.15 um 21:57 schrieb Adam Bartoš:

...
Even if sys.stdin contained a file-like object with proper encoding attribute, it wouldn't work since sys.stdin has to be instance of <type 'file'>. So the question is, whether it is possible to make a file instance in Python that is also customizable so it may call my code. For the first thing, how to change the value of encoding attribute of a file object.

If, by "in Python", you mean both "in pure Python", and "in Python 2", then the answer is no. If you can add arbitrary C code, then you might be able to hack your C library's stdio implementation to delegate fread calls to your code.

I recommend to use Python 3 instead.

Regards, Martin

Glenn Linderman

5:22 p.m.

On 5/9/2015 5:39 AM, Adam Bartoš wrote:

...

I already have a solution in Python 3 (see https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console), I was just considering adding support for Python 2 as well. I think I have an working example in Python 2 using ctypes.

Is this going to get released in 3.5, I hope? Python 3 is pretty limited without some solution for Unicode on the console... probably the biggest deficiency I have found in Python 3, since its introduction. It has great Unicode support for files and processing, which convinced me to switch from Perl, and I like so much else about it, that I can hardly code in Perl any more (I still support a few Perl programs, but have ported most of them to Python). I wondered if all your recent questions about Py 2 were as a result of porting the above to Py 2... I only have one program left that I was forced to write in Py 2 because of library dependencies, and I think that library is finally being ported to Py 3, whew! So while I laud your efforts, and no doubt they will benefit some folks for a few years yet, I hope never to use your Py 2 port myself!

Adam Bartoš

10 May 10 May

1:28 p.m.

Glenn Linderman wrote:

...

Is this going to get released in 3.5, I hope? Python 3 is pretty limited without some solution for Unicode on the console... probably the biggest deficiency I have found in Python 3, since its introduction. It has great Unicode support for files and processing, which convinced me to switch from Perl, and I like so much else about it, that I can hardly code in Perl any more (I still support a few Perl programs, but have ported most of them to Python).

I'd love to see it included in 3.5, but I doubt that will happen. For one thing, it's only two weeks till beta 1, which is feature freeze. And mainly, my package is mostly hacking into existing Python environment. A proper implementation would need some changes in Python someone would have to do. See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm not competent to write a patch myself and I have also no feedback to the proposed idea. On the other hand, using the package is good enough for me so I didn't further bring attention to the proposal.

Nick Coghlan

11 May 11 May

8:09 a.m.

On 10 May 2015 at 23:28, Adam Bartoš <drekin@gmail.com> wrote:

...

Glenn Linderman wrote:

...
Is this going to get released in 3.5, I hope? Python 3 is pretty limited without some solution for Unicode on the console... probably the biggest deficiency I have found in Python 3, since its introduction. It has great Unicode support for files and processing, which convinced me to switch from Perl, and I like so much else about it, that I can hardly code in Perl any more (I still support a few Perl programs, but have ported most of them to Python).

I'd love to see it included in 3.5, but I doubt that will happen. For one thing, it's only two weeks till beta 1, which is feature freeze. And mainly, my package is mostly hacking into existing Python environment. A proper implementation would need some changes in Python someone would have to do. See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm not competent to write a patch myself and I have also no feedback to the proposed idea. On the other hand, using the package is good enough for me so I didn't further bring attention to the proposal.

Right, and while I'm interested in seeing this improved, I'm not especially familiar with the internal details of our terminal interaction implementation, and even less so when it comes to the Windows terminal. Steve Dower's also had his hands full working on the Windows installer changes, and several of our other Windows folks aren't C programmers. PEP 432 (the interpreter startup sequence improvements) will be back on the agenda for Python 3.6, so the 3.6 time frame seems more plausible at this point. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Glenn Linderman

8:38 p.m.

On 5/11/2015 1:09 AM, Nick Coghlan wrote:

...

On 10 May 2015 at 23:28, Adam Bartoš <drekin@gmail.com> wrote:

...
...
Is this going to get released in 3.5, I hope? Python 3 is pretty limited without some solution for Unicode on the console... probably the biggest deficiency I have found in Python 3, since its introduction. It has great Unicode support for files and processing, which convinced me to switch from Perl, and I like so much else about it, that I can hardly code in Perl any more (I still support a few Perl programs, but have ported most of them to Python). I'd love to see it included in 3.5, but I doubt that will happen. For one

Glenn Linderman wrote: thing, it's only two weeks till beta 1, which is feature freeze. And mainly, my package is mostly hacking into existing Python environment. A proper implementation would need some changes in Python someone would have to do. See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm not competent to write a patch myself and I have also no feedback to the proposed idea. On the other hand, using the package is good enough for me so I didn't further bring attention to the proposal. Right, and while I'm interested in seeing this improved, I'm not especially familiar with the internal details of our terminal interaction implementation, and even less so when it comes to the Windows terminal. Steve Dower's also had his hands full working on the Windows installer changes, and several of our other Windows folks aren't C programmers.

PEP 432 (the interpreter startup sequence improvements) will be back on the agenda for Python 3.6, so the 3.6 time frame seems more plausible at this point.

Cheers, Nick.

Wow! Another bug that'll reach a decade in age before being fixed...

Nick Coghlan

12 May 12 May

3:59 a.m.

On 12 May 2015 at 06:38, Glenn Linderman <v+python@g.nevcal.com> wrote:

...

On 5/11/2015 1:09 AM, Nick Coghlan wrote: On 10 May 2015 at 23:28, Adam Bartoš <drekin@gmail.com> wrote: I'd love to see it included in 3.5, but I doubt that will happen. For one thing, it's only two weeks till beta 1, which is feature freeze. And mainly, my package is mostly hacking into existing Python environment. A proper implementation would need some changes in Python someone would have to do. See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm not competent to write a patch myself and I have also no feedback to the proposed idea. On the other hand, using the package is good enough for me so I didn't further bring attention to the proposal.

Right, and while I'm interested in seeing this improved, I'm not especially familiar with the internal details of our terminal interaction implementation, and even less so when it comes to the Windows terminal. Steve Dower's also had his hands full working on the Windows installer changes, and several of our other Windows folks aren't C programmers.

PEP 432 (the interpreter startup sequence improvements) will be back on the agenda for Python 3.6, so the 3.6 time frame seems more plausible at this point.

Cheers, Nick.

Wow! Another bug that'll reach a decade in age before being fixed...

Yep, that tends to happen with complex cross-platform bugs & RFEs that require domain expertise in multiple areas to resolve. It's one of the areas that operating system vendors are typically best equipped to handle, but we haven't historically had that kind of major institutional backing for CPython core development (that *is* changing, but it's a relatively recent phenomenon). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

3468

Age (days ago)

3482

Last active (days ago)

List overview

Download

23 comments

10 participants

participants (10)

"Martin v. Löwis"
Adam Bartoš
Alexander Walters
Chris Angelico
Glenn Linderman
Guido van Rossum
Nick Coghlan
Oleg Broytman
Stephen J. Turnbull
Victor Stinner

Unicode literals in Python 2.7

tags

participants (10)