TextIOWrapper callable encoding parameter

Here is another issue that came up in my ongoing adventure porting to Python3... Executive summary: ================== There is no good way to read a text file when the encoding has to be determined by reading the start of the file. A long-winded version of that follows. Scroll down the the "Proposal" section to skip it. Problem: ======== When one opens a text file for reading, one must specify (explicitly or by default) an encoding which Python will use to convert the raw bytes read into Python strings. This means one must know the encoding of a file before opening it, which is usually the case, but not always. Plain text files have no meta-data giving their encoding so sometimes it may not be known and some of the file must be read and a guess made. Other data like html pages, xml files or python source code have encoding information inside them, but that too requires reading the start of the file without knowing the encoding in advance. I see three ways in general in Python 3 currently to attack this problem, but each has some severe drawbacks: 1. The most straight-forward way to handle this is to open the file twice, first in binary mode or with latin1 encoding and again in text mode after the encoding has been determined This of course has a performance cost since the data is read twice. Further, it can't be used if the data source is a from a pipe, socket or other non-rewindable source. This includes sys.stdin when it comes from a pipe. 2. Alternatively, with a little more expertise, one can rewrap the open binary stream in a TextIOWrapper to avoid a second OS file open. The standard library's tokenize.open() function does this: def open(filename): buffer = builtins.open(filename, 'rb') encoding, lines = detect_encoding(buffer.readline) buffer.seek(0) text = TextIOWrapper(buffer, encoding, line_buffering=True) text.mode = 'r' return text This too seems to read the data twice and of course the seek(0) prevents this method also from being usable with pipes, sockets and other non-seekable sources. 3. Another method is to simply leave the file open in binary mode, read bytes data, and manually decode it to text. This seems to be the only option when reading from non-rewindable sources like pipes and sockets, etc. But then ones looses the all the advantages of having a text stream even though one wants to be reading text! And if one tries to hide this, one ends up reimplementing a good part of TextIOWrapper! I believe these problems could be addressed with a fairly simple and clean modification of the io.TextIOWrapper class... Proposal ======== The following is a logical description; I don't mean to imply that the code must follow this outline exactly. It is based on looking at _pyio; I hope the C code is equivalent. 1. Allow io.TextIOWrapper's encoding parameter to be a callable object in addition to a string or None. 2. In __init__(), if the encoding parameter was callable, record it as an encoding hook and leave encoding set to None. 3. The places in Io.TextIOWrapper that currently read undecoded data from the internal buffer object and decode (only methods read() and read_chunk() I think) it would be modified to do so in this way: 4. Read data from the buffer object as is done now. 5. If the encoding has been set, get a decoder if necessary and continue on as usual. 6. If the encoding is None, call the encoding callable with the data just read and the buffer object. 7. The callable will examine the data, possibly using the buffer object's peek method to look further ahead in the file. It returns the name of an encoding. 8. io.TextIOWrapper will get the encoding and record it, and setup the decoder the same way as if the encoding name had been received as a parameter, decode the read data and continue on as usual. 9. In other non-read paths where encoding needs to be known, raise an error if it is still None. Were io.TextWrapper modified this way, it would offer: * Better performance since there is no need to reread data * Read data is decoded after being examined so the stream is usable with serial datasources like pipes, sockets, etc. * User code is simplified and clearer; there is better separation of concerns. For example, the code in the "Problem" section could be written: stream = open(filename, encoding=detect_encoding): ... def detect_encoding (data, buffer): # This is still basically the same function as # in the code in the "Problem" section. ... look for Python coding declaration in first two lines of the 'data' bytes object. if not found_encoding: raise Error ("unable to determine encoding") return found_encoding I have modified a copy the _pyio module as described and the changes required seemed unsurprising and relatively few, though I am sure there are subtleties and other considerations I am missing. Hence this post seeking feedback...

Nick Coghlan writes:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
ungetch()? If you're only interested in the top of the file (see below), I would suggest allowing only one bufferfull, and then simply rewinding the buffer pointer once you're done. This is one strategy used by Emacsen for encoding detection (for the reason pointed out by Rurpy: not all streams are rewindable). But is that really "easier"? It might be more general, but you still need to reinitialize the encoding (ie, from the trivial "binary" to whatever is detected), with all the hair that comes with that.
This may be insufficiently general. Specifically, both Emacsen and vi allow specification of editor configuration variables at the bottom of the file as well as the top. I don't know whether vi allows encoding specs at the bottom, but Emacsen do (but only for files). I wouldn't recommend paying much attention to what Emacsen actually *do* when initializing a stream (it's, uh, "baroque").

2012/6/11 Nick Coghlan <ncoghlan@gmail.com>:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
BufferedReader has already an useful peek() method to read data without changing the position. http://docs.python.org/library/io.html#io.BufferedReader.peek It's not perfect ("The number of bytes returned may be less or more than requested.") but better than nothing. Victor

On Tue, 12 Jun 2012 01:10:47 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
Except that it would be limited by buffer size, which is not necessarily something you have control over. Regards Antoine.

On Mon, Jun 11, 2012 at 8:42 AM, Rurpy <rurpy@yahoo.com> wrote:
FWIW, the import system does an encoding check on Python source files that is somewhat related. See http://www.python.org/dev/peps/pep-0263/. -eric

Some months ago, I proposed to automatically detect if a file contains a BOM and uses it to set the encoding. Various methods were proposed but there was no real consensus. One proposition was to use a codec (e.g. "bom") which uses the BOM if it is present, and so don't need to reread the file twice. For the pipe issue: it depends where the encoding specification is. If the encoding is written at the end of your "file" (stream), you have to store the whole stream content (few MB or maybe much more?) into memory. If it is in the first lines, you have to store these lines in a buffer. It's not easy to decide for the threshold. I don't like the codec approach because the codec is disconnected from the stream. For example, the codec doesn't know the current position in stream nor can read a few more bytes forward or backward. If you open the file in "append" mode, you are not writing at the beginning but at the end of the file. You may also seek at an arbitrary position before the first read... There are also some special cases. For example, when a text file is opened in write mode, the file is seekable and the file position is not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM in the middle of the file. (See also Lib/test/test_io.py for related tests.)
That's my favorite method because you have the full control on the stream. (I wrote tokenize.open). But yes, it does not work on non-seekable streams (e.g. pipes).
Does it really matter? You usually need to read few bytes to get the encoding.
9. In other non-read paths where encoding needs to be known, raise an error if it is still None.
Why not reading data until you the encoding is known instead?
Can you post the modified somewhere so I can play with it? Victor

Nick Coghlan writes:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
ungetch()? If you're only interested in the top of the file (see below), I would suggest allowing only one bufferfull, and then simply rewinding the buffer pointer once you're done. This is one strategy used by Emacsen for encoding detection (for the reason pointed out by Rurpy: not all streams are rewindable). But is that really "easier"? It might be more general, but you still need to reinitialize the encoding (ie, from the trivial "binary" to whatever is detected), with all the hair that comes with that.
This may be insufficiently general. Specifically, both Emacsen and vi allow specification of editor configuration variables at the bottom of the file as well as the top. I don't know whether vi allows encoding specs at the bottom, but Emacsen do (but only for files). I wouldn't recommend paying much attention to what Emacsen actually *do* when initializing a stream (it's, uh, "baroque").

2012/6/11 Nick Coghlan <ncoghlan@gmail.com>:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
BufferedReader has already an useful peek() method to read data without changing the position. http://docs.python.org/library/io.html#io.BufferedReader.peek It's not perfect ("The number of bytes returned may be less or more than requested.") but better than nothing. Victor

On Tue, 12 Jun 2012 01:10:47 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.
Except that it would be limited by buffer size, which is not necessarily something you have control over. Regards Antoine.

On Mon, Jun 11, 2012 at 8:42 AM, Rurpy <rurpy@yahoo.com> wrote:
FWIW, the import system does an encoding check on Python source files that is somewhat related. See http://www.python.org/dev/peps/pep-0263/. -eric

Some months ago, I proposed to automatically detect if a file contains a BOM and uses it to set the encoding. Various methods were proposed but there was no real consensus. One proposition was to use a codec (e.g. "bom") which uses the BOM if it is present, and so don't need to reread the file twice. For the pipe issue: it depends where the encoding specification is. If the encoding is written at the end of your "file" (stream), you have to store the whole stream content (few MB or maybe much more?) into memory. If it is in the first lines, you have to store these lines in a buffer. It's not easy to decide for the threshold. I don't like the codec approach because the codec is disconnected from the stream. For example, the codec doesn't know the current position in stream nor can read a few more bytes forward or backward. If you open the file in "append" mode, you are not writing at the beginning but at the end of the file. You may also seek at an arbitrary position before the first read... There are also some special cases. For example, when a text file is opened in write mode, the file is seekable and the file position is not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM in the middle of the file. (See also Lib/test/test_io.py for related tests.)
That's my favorite method because you have the full control on the stream. (I wrote tokenize.open). But yes, it does not work on non-seekable streams (e.g. pipes).
Does it really matter? You usually need to read few bytes to get the encoding.
9. In other non-read paths where encoding needs to be known, raise an error if it is still None.
Why not reading data until you the encoding is known instead?
Can you post the modified somewhere so I can play with it? Victor
participants (6)
-
Antoine Pitrou
-
Eric Snow
-
Nick Coghlan
-
Rurpy
-
Stephen J. Turnbull
-
Victor Stinner