[Python-ideas] TextIOWrapper callable encoding parameter

Mon Jun 11 16:42:46 CEST 2012

Here is another issue that came up in my ongoing
adventure porting to Python3...

Executive summary:
==================

There is no good way to read a text file when the 
encoding has to be determined by reading the start
of the file.  A long-winded version of that follows.
Scroll down the the "Proposal" section to skip it.

Problem:
========

When one opens a text file for reading, one must specify 
(explicitly or by default) an encoding which Python will
use to convert the raw bytes read into Python strings.  
This means one must know the encoding of a file before 
opening it, which is usually the case, but not always.

Plain text files have no meta-data giving their encoding 
so sometimes it may not be known and some of the file must 
be read and a guess made.  Other data like html pages, xml 
files or python source code have encoding information inside 
them, but that too requires reading the start of the file 
without knowing the encoding in advance.

I see three ways in general in Python 3 currently to attack 
this problem, but each has some severe drawbacks:

1.  The most straight-forward way to handle this is to open
the file twice, first in binary mode or with latin1 encoding
and again in text mode after the encoding has been determined
This of course has a performance cost since the data is read
twice.  Further, it can't be used if the data source is a 
from a pipe, socket or other non-rewindable source.  This
includes sys.stdin when it comes from a pipe.

2.  Alternatively, with a little more expertise, one can rewrap 
the open binary stream in a TextIOWrapper to avoid a second
OS file open.  The standard library's tokenize.open() 
function does this:

    def open(filename):
        buffer = builtins.open(filename, 'rb')
        encoding, lines = detect_encoding(buffer.readline)
        buffer.seek(0)
        text = TextIOWrapper(buffer, encoding, line_buffering=True)
        text.mode = 'r'
        return text

This too seems to read the data twice and of course the 
seek(0) prevents this method also from being usable with
pipes, sockets and other non-seekable sources.

3.  Another method is to simply leave the file open in 
binary mode, read bytes data, and manually decode it to 
text.  This seems to be the only option when reading from 
non-rewindable sources like pipes and sockets, etc.
But then ones looses the all the advantages of having 
a text stream even though one wants to be reading text!
And if one tries to hide this, one ends up reimplementing
a good part of TextIOWrapper!

I believe these problems could be addressed with a fairly 
simple and clean modification of the io.TextIOWrapper
class... 

Proposal
========
The following is a logical description; I don't mean to 
imply that the code must follow this outline exactly.
It is based on looking at _pyio;  I hope the C code is
equivalent.

1. Allow io.TextIOWrapper's encoding parameter to be a
 callable object in addition to a string or None.

2. In __init__(), if the encoding parameter was callable, 
 record it as an encoding hook and leave encoding set to
 None.

3. The places in Io.TextIOWrapper that currently read
 undecoded data from the internal buffer object and decode
 (only methods read() and read_chunk() I think) it would
 be modified to do so in this way:

4. Read data from the buffer object as is done now.

5. If the encoding has been set, get a decoder if necessary
 and continue on as usual.

6. If the encoding is None, call the encoding callable
 with the data just read and the buffer object.

7. The callable will examine the data, possibly using the
 buffer object's peek method to look further ahead in the
 file.  It returns the name of an encoding.

8. io.TextIOWrapper will get the encoding and record it,
 and setup the decoder the same way as if the encoding name
 had been received as a parameter, decode the read data and
 continue on as usual.

9. In other non-read paths where encoding needs to be known,
 raise an error if it is still None.

Were io.TextWrapper modified this way, it would offer:

* Better performance since there is no need to reread data

* Read data is decoded after being examined so the stream
 is usable with serial datasources like pipes, sockets, etc.

* User code is simplified and clearer; there is better
 separation of concerns.  For example, the code in the 
 "Problem" section could be written:

    stream = open(filename, encoding=detect_encoding):
    ...
    def detect_encoding (data, buffer):
	# This is still basically the same function as
	# in the code in the "Problem" section.
        ... look for Python coding declaration in
            first two lines of the 'data' bytes object.
        if not found_encoding:
           raise Error ("unable to determine encoding")
        return found_encoding

I have modified a copy the _pyio module as described and 
the changes required seemed unsurprising and relatively
few, though I am sure there are subtleties and other
considerations I am missing.  Hence this post seeking
feedback...