[Python-ideas] TextIOWrapper callable encoding parameter
ncoghlan at gmail.com
Mon Jun 11 17:10:47 CEST 2012
Immediate thought: it seems like it would be easier to offer a way to
inject data back into a buffered IO object's internal buffer.
Sent from my phone, thus the relative brevity :)
On Jun 12, 2012 12:43 AM, "Rurpy" <rurpy at yahoo.com> wrote:
> Here is another issue that came up in my ongoing
> adventure porting to Python3...
> Executive summary:
> There is no good way to read a text file when the
> encoding has to be determined by reading the start
> of the file. A long-winded version of that follows.
> Scroll down the the "Proposal" section to skip it.
> When one opens a text file for reading, one must specify
> (explicitly or by default) an encoding which Python will
> use to convert the raw bytes read into Python strings.
> This means one must know the encoding of a file before
> opening it, which is usually the case, but not always.
> Plain text files have no meta-data giving their encoding
> so sometimes it may not be known and some of the file must
> be read and a guess made. Other data like html pages, xml
> files or python source code have encoding information inside
> them, but that too requires reading the start of the file
> without knowing the encoding in advance.
> I see three ways in general in Python 3 currently to attack
> this problem, but each has some severe drawbacks:
> 1. The most straight-forward way to handle this is to open
> the file twice, first in binary mode or with latin1 encoding
> and again in text mode after the encoding has been determined
> This of course has a performance cost since the data is read
> twice. Further, it can't be used if the data source is a
> from a pipe, socket or other non-rewindable source. This
> includes sys.stdin when it comes from a pipe.
> 2. Alternatively, with a little more expertise, one can rewrap
> the open binary stream in a TextIOWrapper to avoid a second
> OS file open. The standard library's tokenize.open()
> function does this:
> def open(filename):
> buffer = builtins.open(filename, 'rb')
> encoding, lines = detect_encoding(buffer.readline)
> text = TextIOWrapper(buffer, encoding, line_buffering=True)
> text.mode = 'r'
> return text
> This too seems to read the data twice and of course the
> seek(0) prevents this method also from being usable with
> pipes, sockets and other non-seekable sources.
> 3. Another method is to simply leave the file open in
> binary mode, read bytes data, and manually decode it to
> text. This seems to be the only option when reading from
> non-rewindable sources like pipes and sockets, etc.
> But then ones looses the all the advantages of having
> a text stream even though one wants to be reading text!
> And if one tries to hide this, one ends up reimplementing
> a good part of TextIOWrapper!
> I believe these problems could be addressed with a fairly
> simple and clean modification of the io.TextIOWrapper
> The following is a logical description; I don't mean to
> imply that the code must follow this outline exactly.
> It is based on looking at _pyio; I hope the C code is
> 1. Allow io.TextIOWrapper's encoding parameter to be a
> callable object in addition to a string or None.
> 2. In __init__(), if the encoding parameter was callable,
> record it as an encoding hook and leave encoding set to
> 3. The places in Io.TextIOWrapper that currently read
> undecoded data from the internal buffer object and decode
> (only methods read() and read_chunk() I think) it would
> be modified to do so in this way:
> 4. Read data from the buffer object as is done now.
> 5. If the encoding has been set, get a decoder if necessary
> and continue on as usual.
> 6. If the encoding is None, call the encoding callable
> with the data just read and the buffer object.
> 7. The callable will examine the data, possibly using the
> buffer object's peek method to look further ahead in the
> file. It returns the name of an encoding.
> 8. io.TextIOWrapper will get the encoding and record it,
> and setup the decoder the same way as if the encoding name
> had been received as a parameter, decode the read data and
> continue on as usual.
> 9. In other non-read paths where encoding needs to be known,
> raise an error if it is still None.
> Were io.TextWrapper modified this way, it would offer:
> * Better performance since there is no need to reread data
> * Read data is decoded after being examined so the stream
> is usable with serial datasources like pipes, sockets, etc.
> * User code is simplified and clearer; there is better
> separation of concerns. For example, the code in the
> "Problem" section could be written:
> stream = open(filename, encoding=detect_encoding):
> def detect_encoding (data, buffer):
> # This is still basically the same function as
> # in the code in the "Problem" section.
> ... look for Python coding declaration in
> first two lines of the 'data' bytes object.
> if not found_encoding:
> raise Error ("unable to determine encoding")
> return found_encoding
> I have modified a copy the _pyio module as described and
> the changes required seemed unsurprising and relatively
> few, though I am sure there are subtleties and other
> considerations I am missing. Hence this post seeking
> Python-ideas mailing list
> Python-ideas at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas