[Python-ideas] TextIOWrapper callable encoding parameter

Mon Jun 11 17:10:47 CEST 2012

Immediate thought: it seems like it would be easier to offer a way to
inject data back into a buffered IO object's internal buffer.

--
Sent from my phone, thus the relative brevity :)
On Jun 12, 2012 12:43 AM, "Rurpy" <rurpy at yahoo.com> wrote:

> Here is another issue that came up in my ongoing
> adventure porting to Python3...
>
> Executive summary:
> ==================
>
> There is no good way to read a text file when the
> encoding has to be determined by reading the start
> of the file.  A long-winded version of that follows.
> Scroll down the the "Proposal" section to skip it.
>
> Problem:
> ========
>
> When one opens a text file for reading, one must specify
> (explicitly or by default) an encoding which Python will
> use to convert the raw bytes read into Python strings.
> This means one must know the encoding of a file before
> opening it, which is usually the case, but not always.
>
> Plain text files have no meta-data giving their encoding
> so sometimes it may not be known and some of the file must
> be read and a guess made.  Other data like html pages, xml
> files or python source code have encoding information inside
> them, but that too requires reading the start of the file
> without knowing the encoding in advance.
>
> I see three ways in general in Python 3 currently to attack
> this problem, but each has some severe drawbacks:
>
> 1.  The most straight-forward way to handle this is to open
> the file twice, first in binary mode or with latin1 encoding
> and again in text mode after the encoding has been determined
> This of course has a performance cost since the data is read
> twice.  Further, it can't be used if the data source is a
> from a pipe, socket or other non-rewindable source.  This
> includes sys.stdin when it comes from a pipe.
>
> 2.  Alternatively, with a little more expertise, one can rewrap
> the open binary stream in a TextIOWrapper to avoid a second
> OS file open.  The standard library's tokenize.open()
> function does this:
>
>    def open(filename):
>        buffer = builtins.open(filename, 'rb')
>        encoding, lines = detect_encoding(buffer.readline)
>        buffer.seek(0)
>        text = TextIOWrapper(buffer, encoding, line_buffering=True)
>        text.mode = 'r'
>        return text
>
> This too seems to read the data twice and of course the
> seek(0) prevents this method also from being usable with
> pipes, sockets and other non-seekable sources.
>
> 3.  Another method is to simply leave the file open in
> binary mode, read bytes data, and manually decode it to
> text.  This seems to be the only option when reading from
> non-rewindable sources like pipes and sockets, etc.
> But then ones looses the all the advantages of having
> a text stream even though one wants to be reading text!
> And if one tries to hide this, one ends up reimplementing
> a good part of TextIOWrapper!
>
> I believe these problems could be addressed with a fairly
> simple and clean modification of the io.TextIOWrapper
> class...
>
> Proposal
> ========
> The following is a logical description; I don't mean to
> imply that the code must follow this outline exactly.
> It is based on looking at _pyio;  I hope the C code is
> equivalent.
>
> 1. Allow io.TextIOWrapper's encoding parameter to be a
>  callable object in addition to a string or None.
>
> 2. In __init__(), if the encoding parameter was callable,
>  record it as an encoding hook and leave encoding set to
>  None.
>
> 3. The places in Io.TextIOWrapper that currently read
>  undecoded data from the internal buffer object and decode
>  (only methods read() and read_chunk() I think) it would
>  be modified to do so in this way:
>
> 4. Read data from the buffer object as is done now.
>
> 5. If the encoding has been set, get a decoder if necessary
>  and continue on as usual.
>
> 6. If the encoding is None, call the encoding callable
>  with the data just read and the buffer object.
>
> 7. The callable will examine the data, possibly using the
>  buffer object's peek method to look further ahead in the
>  file.  It returns the name of an encoding.
>
> 8. io.TextIOWrapper will get the encoding and record it,
>  and setup the decoder the same way as if the encoding name
>  had been received as a parameter, decode the read data and
>  continue on as usual.
>
> 9. In other non-read paths where encoding needs to be known,
>  raise an error if it is still None.
>
> Were io.TextWrapper modified this way, it would offer:
>
> * Better performance since there is no need to reread data
>
> * Read data is decoded after being examined so the stream
>  is usable with serial datasources like pipes, sockets, etc.
>
> * User code is simplified and clearer; there is better
>  separation of concerns.  For example, the code in the
>  "Problem" section could be written:
>
>    stream = open(filename, encoding=detect_encoding):
>    ...
>    def detect_encoding (data, buffer):
>        # This is still basically the same function as
>        # in the code in the "Problem" section.
>        ... look for Python coding declaration in
>            first two lines of the 'data' bytes object.
>        if not found_encoding:
>           raise Error ("unable to determine encoding")
>        return found_encoding
>
> I have modified a copy the _pyio module as described and
> the changes required seemed unsurprising and relatively
> few, though I am sure there are subtleties and other
> considerations I am missing.  Hence this post seeking
> feedback...
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20120612/65862432/attachment.html>