[Python-ideas] TextIOWrapper callable encoding parameter
Rurpy
rurpy at yahoo.com
Mon Jun 11 16:42:46 CEST 2012
Here is another issue that came up in my ongoing
adventure porting to Python3...
Executive summary:
==================
There is no good way to read a text file when the
encoding has to be determined by reading the start
of the file. A long-winded version of that follows.
Scroll down the the "Proposal" section to skip it.
Problem:
========
When one opens a text file for reading, one must specify
(explicitly or by default) an encoding which Python will
use to convert the raw bytes read into Python strings.
This means one must know the encoding of a file before
opening it, which is usually the case, but not always.
Plain text files have no meta-data giving their encoding
so sometimes it may not be known and some of the file must
be read and a guess made. Other data like html pages, xml
files or python source code have encoding information inside
them, but that too requires reading the start of the file
without knowing the encoding in advance.
I see three ways in general in Python 3 currently to attack
this problem, but each has some severe drawbacks:
1. The most straight-forward way to handle this is to open
the file twice, first in binary mode or with latin1 encoding
and again in text mode after the encoding has been determined
This of course has a performance cost since the data is read
twice. Further, it can't be used if the data source is a
from a pipe, socket or other non-rewindable source. This
includes sys.stdin when it comes from a pipe.
2. Alternatively, with a little more expertise, one can rewrap
the open binary stream in a TextIOWrapper to avoid a second
OS file open. The standard library's tokenize.open()
function does this:
def open(filename):
buffer = builtins.open(filename, 'rb')
encoding, lines = detect_encoding(buffer.readline)
buffer.seek(0)
text = TextIOWrapper(buffer, encoding, line_buffering=True)
text.mode = 'r'
return text
This too seems to read the data twice and of course the
seek(0) prevents this method also from being usable with
pipes, sockets and other non-seekable sources.
3. Another method is to simply leave the file open in
binary mode, read bytes data, and manually decode it to
text. This seems to be the only option when reading from
non-rewindable sources like pipes and sockets, etc.
But then ones looses the all the advantages of having
a text stream even though one wants to be reading text!
And if one tries to hide this, one ends up reimplementing
a good part of TextIOWrapper!
I believe these problems could be addressed with a fairly
simple and clean modification of the io.TextIOWrapper
class...
Proposal
========
The following is a logical description; I don't mean to
imply that the code must follow this outline exactly.
It is based on looking at _pyio; I hope the C code is
equivalent.
1. Allow io.TextIOWrapper's encoding parameter to be a
callable object in addition to a string or None.
2. In __init__(), if the encoding parameter was callable,
record it as an encoding hook and leave encoding set to
None.
3. The places in Io.TextIOWrapper that currently read
undecoded data from the internal buffer object and decode
(only methods read() and read_chunk() I think) it would
be modified to do so in this way:
4. Read data from the buffer object as is done now.
5. If the encoding has been set, get a decoder if necessary
and continue on as usual.
6. If the encoding is None, call the encoding callable
with the data just read and the buffer object.
7. The callable will examine the data, possibly using the
buffer object's peek method to look further ahead in the
file. It returns the name of an encoding.
8. io.TextIOWrapper will get the encoding and record it,
and setup the decoder the same way as if the encoding name
had been received as a parameter, decode the read data and
continue on as usual.
9. In other non-read paths where encoding needs to be known,
raise an error if it is still None.
Were io.TextWrapper modified this way, it would offer:
* Better performance since there is no need to reread data
* Read data is decoded after being examined so the stream
is usable with serial datasources like pipes, sockets, etc.
* User code is simplified and clearer; there is better
separation of concerns. For example, the code in the
"Problem" section could be written:
stream = open(filename, encoding=detect_encoding):
...
def detect_encoding (data, buffer):
# This is still basically the same function as
# in the code in the "Problem" section.
... look for Python coding declaration in
first two lines of the 'data' bytes object.
if not found_encoding:
raise Error ("unable to determine encoding")
return found_encoding
I have modified a copy the _pyio module as described and
the changes required seemed unsurprising and relatively
few, though I am sure there are subtleties and other
considerations I am missing. Hence this post seeking
feedback...
More information about the Python-ideas
mailing list