Re: [Python-ideas] TextIOWrapper callable encoding parameter

On 06/12/2012 03:48 PM, Victor Stinner wrote:
That's always a problem. When trying to determine a character encoding one may have to read the entire file because it could consist of all ascii characters except the very last one. (And of course there is no guarantee one can determine *the* encoding at all). Nevertheless, I think thee is a very large class of problems that can be usefully handled by looking at a limited amount of data at the start of a file (or stream). The Python coding declaration in one example (obviously picked hoping it would have some resonance here.) The buffer object used by TextIOWrapper already reads the start of the stream and buffers the first few lines, so why not take advantage of that rather than repeating the work? One of the things I am not sure about is if there are cases when the buffered read returns, say, only one line, as might happen with tty input.
A callable encoding parameter would not be terribly useful with a file opened in write or append mode, but it's behavior would be predictable: a write would result in an error because the encoding hadn't been set. A read in the middle' of the file would work the same way as at the beginning. This is probably not very useful, but is consistent. Of course one could choose to implement a callable encoding parameter such that some or all of these paths are detected at open and declared illegal then. One could prohibit the encoding call after a seek though I'm not sure there is any point to that.
It certainly matters if input is from a pipe. Quoting from my other message: $ cat test.utf8 | python3 stdin.py reopen1 got exception: [Errno 29] Illegal seek The whole point of my suggestion was that you've already read those few bytes -- but by the time you have access to them, you've already been forced to choose an encoding. My suggestion simply defers that encoding setting until after you've had a chance to look at the bytes.
That's how I do it now -- open file in binary mode and read it, buffer it, determine encoding, and henceforth decode the bytes data "by hand" to text. But that's an awful lot like what TextIOWrpper does, yes? Why can't I use TextIOWrapper instead of rewriting it myself? (Yes, I know I can reopen or rewrap the binary stream but as I said, that loses the one-pass processing which breaks pipes.)
I put a diff against the Python-3.2.3 _pyio.py file at: http://pastebin.com/kZHmcBdm Much of the diff is just moving existing stuff around. The note at the bottom says: | It is in no way supposed to be a serious patch. | | It was the minimal changes I could make in order to | see if my suggestion to allow a callable encoding parameter | in TextIOWrapper was feasible, and allow some timing tests. | | I am quite sure it will not pass the Python's tests. | | It does I hope give some idea of the nature and scale of the | code changes needed to implement a callable encodign parameter.
participants (1)
-
Rurpy