
Thinking about entering Japanese into raw_input() in IDLE more, I thought I figured a way to give Takeuchi a Unicode string when he enters Japanese characters. I added an experimental patch to the readline method of the PyShell class: if the line just read, when converted to Unicode, has fewer characters but still compares equal (and no exceptions happen during this test) then return the Unicode version. This doesn't currently work because the built-in raw_input() function requires that the readline() call it makes internally returns an 8-bit string. Should I relax that requirement in general? (I could also just replace __builtin__.[raw_]input with more liberal versions supplied by IDLE.) I also discovered that the built-in unicode() function is not idempotent: unicode(unicode('a')) returns u'\000a'. I think it should special-case this and return u'a' ! Finally, I believe we need a way to discover the encoding used by stdin or stdout. I have to admit I know very little about the file wrappers that Marc wrote -- is it easy to get the encoding out of them? IDLE should probably emulate this, as it's encoding is clearly UTF-8 (at least when using Tcl 8.1 or newer). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Thinking about entering Japanese into raw_input() in IDLE more, I thought I figured a way to give Takeuchi a Unicode string when he enters Japanese characters.
I added an experimental patch to the readline method of the PyShell class: if the line just read, when converted to Unicode, has fewer characters but still compares equal (and no exceptions happen during this test) then return the Unicode version.
This doesn't currently work because the built-in raw_input() function requires that the readline() call it makes internally returns an 8-bit string. Should I relax that requirement in general? (I could also just replace __builtin__.[raw_]input with more liberal versions supplied by IDLE.)
I also discovered that the built-in unicode() function is not idempotent: unicode(unicode('a')) returns u'\000a'. I think it should special-case this and return u'a' !
Good idea. I'll fix this in the next round.
Finally, I believe we need a way to discover the encoding used by stdin or stdout. I have to admit I know very little about the file wrappers that Marc wrote -- is it easy to get the encoding out of them?
I'm not sure what you mean: the name of the input encoding ? Currently, only the names of the encoding and decoding functions are available to be queried.
IDLE should probably emulate this, as it's encoding is clearly UTF-8 (at least when using Tcl 8.1 or newer).
It should be possible to redirect sys.stdin/stdout using the codecs.EncodedFile wrapper. Some tests show that raw_input() doesn't seem to use the redirected sys.stdin though...
sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1') s = raw_input() äöü s '\344\366\374' s = sys.stdin.read() äöü s '\303\244\303\266\303\274\012'
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Finally, I believe we need a way to discover the encoding used by stdin or stdout. I have to admit I know very little about the file wrappers that Marc wrote -- is it easy to get the encoding out of them?
I'm not sure what you mean: the name of the input encoding ? Currently, only the names of the encoding and decoding functions are available to be queried.
Whatever is helpful for a module or program that wants to know what kind of encoding is used.
IDLE should probably emulate this, as it's encoding is clearly UTF-8 (at least when using Tcl 8.1 or newer).
It should be possible to redirect sys.stdin/stdout using the codecs.EncodedFile wrapper. Some tests show that raw_input() doesn't seem to use the redirected sys.stdin though...
sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1') s = raw_input() äöü s '\344\366\374' s = sys.stdin.read() äöü s '\303\244\303\266\303\274\012'
This deserves more looking into. The code for raw_input() in bltinmodule.c certainly *tries* to use sys.stdin. (I think that because your EncodedFile object is not a real stdio file object, it will take the second branch, near the end of the function; this calls PyFile_GetLine() which attempts to call readline().) Aha! It actually seems that your read() and readline() are inconsistent! I don't know your API well enough to know which string is "correct" (\344\366\374 or \303\244\303\266\303\274) but when I call sys.stdin.readline() I get the same as raw_input() returns:
from codecs import * sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1') s = raw_input() äöü s '\344\366\374' s = sys.stdin.read() äöü
s '\303\244\303\266\303\274\012' unicode(s) u'\344\366\374\012' s = sys.stdin.readline() äöü s '\344\366\374\012'
Didn't you say that your wrapper only wraps read()? Maybe you need to revise that decision! (Note that PyShell doesn't even define read() -- it only defines readline().) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Finally, I believe we need a way to discover the encoding used by stdin or stdout. I have to admit I know very little about the file wrappers that Marc wrote -- is it easy to get the encoding out of them?
I'm not sure what you mean: the name of the input encoding ? Currently, only the names of the encoding and decoding functions are available to be queried.
Whatever is helpful for a module or program that wants to know what kind of encoding is used.
IDLE should probably emulate this, as it's encoding is clearly UTF-8 (at least when using Tcl 8.1 or newer).
It should be possible to redirect sys.stdin/stdout using the codecs.EncodedFile wrapper. Some tests show that raw_input() doesn't seem to use the redirected sys.stdin though...
sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1') s = raw_input() äöü s '\344\366\374' s = sys.stdin.read() äöü s '\303\244\303\266\303\274\012'
The latter is the "correct" output, BTW.
This deserves more looking into. The code for raw_input() in bltinmodule.c certainly *tries* to use sys.stdin. (I think that because your EncodedFile object is not a real stdio file object, it will take the second branch, near the end of the function; this calls PyFile_GetLine() which attempts to call readline().)
Aha! It actually seems that your read() and readline() are inconsistent!
They are because I haven't yet found a way to implement readline() without buffering read-ahead data. The only way I can think of to implement it without buffering would be to read one char at a time which is much too slow. Buffering is hard to implement right when assuming that streams are stacked... every level would have its own buffering scheme and mixing .read() and .readline() wouldn't work too well. Anyway, I'll give it try... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Aha! It actually seems that your read() and readline() are inconsistent!
They are because I haven't yet found a way to implement readline() without buffering read-ahead data. The only way I can think of to implement it without buffering would be to read one char at a time which is much too slow.
Buffering is hard to implement right when assuming that streams are stacked... every level would have its own buffering scheme and mixing .read() and .readline() wouldn't work too well. Anyway, I'll give it try...
Since you're calling methods on the underlying file object anyway, can't you avoid buffering by calling the *corresponding* underlying method and doing the conversion on that? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Aha! It actually seems that your read() and readline() are inconsistent!
They are because I haven't yet found a way to implement readline() without buffering read-ahead data. The only way I can think of to implement it without buffering would be to read one char at a time which is much too slow.
Buffering is hard to implement right when assuming that streams are stacked... every level would have its own buffering scheme and mixing .read() and .readline() wouldn't work too well. Anyway, I'll give it try...
Since you're calling methods on the underlying file object anyway, can't you avoid buffering by calling the *corresponding* underlying method and doing the conversion on that?
The problem here is that Unicode has far more line break characters than plain ASCII. The underlying API would break on ASCII lines (or even worse on those CRLF sequences defined by the C lib), not the ones I need for Unicode. BTW, I think that we may need a new Codec class layer here: .readline() et al. are all text based methods, while the Codec base classes clearly work on all kinds of binary and text data. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Since you're calling methods on the underlying file object anyway, can't you avoid buffering by calling the *corresponding* underlying method and doing the conversion on that?
The problem here is that Unicode has far more line break characters than plain ASCII. The underlying API would break on ASCII lines (or even worse on those CRLF sequences defined by the C lib), not the ones I need for Unicode.
Hm, can't we just use \n for now?
BTW, I think that we may need a new Codec class layer here: .readline() et al. are all text based methods, while the Codec base classes clearly work on all kinds of binary and text data.
Not sure what you mean here. Can you explain through an example? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Since you're calling methods on the underlying file object anyway, can't you avoid buffering by calling the *corresponding* underlying method and doing the conversion on that?
The problem here is that Unicode has far more line break characters than plain ASCII. The underlying API would break on ASCII lines (or even worse on those CRLF sequences defined by the C lib), not the ones I need for Unicode.
Hm, can't we just use \n for now?
BTW, I think that we may need a new Codec class layer here: .readline() et al. are all text based methods, while the Codec base classes clearly work on all kinds of binary and text data.
Not sure what you mean here. Can you explain through an example?
Well, the line concept is really only applicable to text data. Binary data doesn't have lines and e.g. a ZIP codec (probably) couldn't implement this kind of method. As it turns out, only the .writelines() method needs to know what kinds of input/output data objects are used (and then only to be able to specify a joining seperator). I'll just leave things as they are for now: quite shallow w/r to the class hierarchy. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Guido van Rossum wrote:
Finally, I believe we need a way to discover the encoding used by stdin or stdout. I have to admit I know very little about the file wrappers that Marc wrote -- is it easy to get the encoding out of them?
I'm not sure what you mean: the name of the input encoding ? Currently, only the names of the encoding and decoding functions are available to be queried.
Whatever is helpful for a module or program that wants to know what kind of encoding is used.
Hmm, you mean something like file.encoding ? I'll add some additional attributes holding the encoding names to the wrapper classes (they will then be set by the wrapper constructor functions). BTW, I've just added .readline() et al. to the codecs... all except .readline() are easy to do. For .readline() I simply delegated line breaking to the underlying stream's .readline() method. This is far from optimal, but better than not having the method at all. I also adjusted the interfaces of the .splitlines() methods: they now take a different optional argument: """ S.splitlines([keepends]]) -> list of strings Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. """ This made implementing the above methods very simple and also allows writing codecs working with other basic storage types (UserString.py anyone ;-). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (2)
-
Guido van Rossum
-
M.-A. Lemburg