<p>Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer. </p>

<p>--<br>

Sent from my phone, thus the relative brevity :) </p>

<div class="gmail_quote">On Jun 12, 2012 12:43 AM, "Rurpy" <<a href="mailto:rurpy@yahoo.com">rurpy@yahoo.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Here is another issue that came up in my ongoing<br>

adventure porting to Python3...<br>

<br>

Executive summary:<br>

==================<br>

<br>

There is no good way to read a text file when the<br>

encoding has to be determined by reading the start<br>

of the file.  A long-winded version of that follows.<br>

Scroll down the the "Proposal" section to skip it.<br>

<br>

Problem:<br>

========<br>

<br>

When one opens a text file for reading, one must specify<br>

(explicitly or by default) an encoding which Python will<br>

use to convert the raw bytes read into Python strings.<br>

This means one must know the encoding of a file before<br>

opening it, which is usually the case, but not always.<br>

<br>

Plain text files have no meta-data giving their encoding<br>

so sometimes it may not be known and some of the file must<br>

be read and a guess made.  Other data like html pages, xml<br>

files or python source code have encoding information inside<br>

them, but that too requires reading the start of the file<br>

without knowing the encoding in advance.<br>

<br>

I see three ways in general in Python 3 currently to attack<br>

this problem, but each has some severe drawbacks:<br>

<br>

1.  The most straight-forward way to handle this is to open<br>

the file twice, first in binary mode or with latin1 encoding<br>

and again in text mode after the encoding has been determined<br>

This of course has a performance cost since the data is read<br>

twice.  Further, it can't be used if the data source is a<br>

from a pipe, socket or other non-rewindable source.  This<br>

includes sys.stdin when it comes from a pipe.<br>

<br>

2.  Alternatively, with a little more expertise, one can rewrap<br>

the open binary stream in a TextIOWrapper to avoid a second<br>

OS file open.  The standard library's tokenize.open()<br>

function does this:<br>

<br>

    def open(filename):<br>

        buffer = builtins.open(filename, 'rb')<br>

        encoding, lines = detect_encoding(buffer.readline)<br>

        buffer.seek(0)<br>

        text = TextIOWrapper(buffer, encoding, line_buffering=True)<br>

        text.mode = 'r'<br>

        return text<br>

<br>

This too seems to read the data twice and of course the<br>

seek(0) prevents this method also from being usable with<br>

pipes, sockets and other non-seekable sources.<br>

<br>

3.  Another method is to simply leave the file open in<br>

binary mode, read bytes data, and manually decode it to<br>

text.  This seems to be the only option when reading from<br>

non-rewindable sources like pipes and sockets, etc.<br>

But then ones looses the all the advantages of having<br>

a text stream even though one wants to be reading text!<br>

And if one tries to hide this, one ends up reimplementing<br>

a good part of TextIOWrapper!<br>

<br>

I believe these problems could be addressed with a fairly<br>

simple and clean modification of the io.TextIOWrapper<br>

class...<br>

<br>

Proposal<br>

========<br>

The following is a logical description; I don't mean to<br>

imply that the code must follow this outline exactly.<br>

It is based on looking at _pyio;  I hope the C code is<br>

equivalent.<br>

<br>

1. Allow io.TextIOWrapper's encoding parameter to be a<br>

 callable object in addition to a string or None.<br>

<br>

2. In __init__(), if the encoding parameter was callable,<br>

 record it as an encoding hook and leave encoding set to<br>

 None.<br>

<br>

3. The places in Io.TextIOWrapper that currently read<br>

 undecoded data from the internal buffer object and decode<br>

 (only methods read() and read_chunk() I think) it would<br>

 be modified to do so in this way:<br>

<br>

4. Read data from the buffer object as is done now.<br>

<br>

5. If the encoding has been set, get a decoder if necessary<br>

 and continue on as usual.<br>

<br>

6. If the encoding is None, call the encoding callable<br>

 with the data just read and the buffer object.<br>

<br>

7. The callable will examine the data, possibly using the<br>

 buffer object's peek method to look further ahead in the<br>

 file.  It returns the name of an encoding.<br>

<br>

8. io.TextIOWrapper will get the encoding and record it,<br>

 and setup the decoder the same way as if the encoding name<br>

 had been received as a parameter, decode the read data and<br>

 continue on as usual.<br>

<br>

9. In other non-read paths where encoding needs to be known,<br>

 raise an error if it is still None.<br>

<br>

Were io.TextWrapper modified this way, it would offer:<br>

<br>

* Better performance since there is no need to reread data<br>

<br>

* Read data is decoded after being examined so the stream<br>

 is usable with serial datasources like pipes, sockets, etc.<br>

<br>

* User code is simplified and clearer; there is better<br>

 separation of concerns.  For example, the code in the<br>

 "Problem" section could be written:<br>

<br>

    stream = open(filename, encoding=detect_encoding):<br>

    ...<br>

    def detect_encoding (data, buffer):<br>

        # This is still basically the same function as<br>

        # in the code in the "Problem" section.<br>

        ... look for Python coding declaration in<br>

            first two lines of the 'data' bytes object.<br>

        if not found_encoding:<br>

           raise Error ("unable to determine encoding")<br>

        return found_encoding<br>

<br>

I have modified a copy the _pyio module as described and<br>

the changes required seemed unsurprising and relatively<br>

few, though I am sure there are subtleties and other<br>

considerations I am missing.  Hence this post seeking<br>

feedback...<br>

<br>

_______________________________________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

<a href="http://mail.python.org/mailman/listinfo/python-ideas" target="_blank">http://mail.python.org/mailman/listinfo/python-ideas</a><br>

</blockquote></div>