[Python-ideas] TextIOWrapper callable encoding parameter

Tue Jun 12 23:48:08 CEST 2012

> 1.  The most straight-forward way to handle this is to open
> the file twice, first in binary mode or with latin1 encoding
> and again in text mode after the encoding has been determined
> This of course has a performance cost since the data is read
> twice.  Further, it can't be used if the data source is a
> from a pipe, socket or other non-rewindable source.  This
> includes sys.stdin when it comes from a pipe.

Some months ago, I proposed to automatically detect if a file contains
a BOM and uses it to set the encoding. Various methods were proposed
but there was no real consensus. One proposition was to use a codec
(e.g. "bom") which uses the BOM if it is present, and so don't need to
reread the file twice.

For the pipe issue: it depends where the encoding specification is. If
the encoding is written at the end of your "file" (stream), you have
to store the whole stream content (few MB or maybe much more?) into
memory. If it is in the first lines, you have to store these lines in
a buffer. It's not easy to decide for the threshold.

I don't like the codec approach because the codec is disconnected from
the stream. For example, the codec doesn't know the current position
in stream nor can read a few more bytes forward or backward. If you
open the file in "append" mode, you are not writing at the beginning
but at the end of the file. You may also seek at an arbitrary position
before the first read...

There are also some special cases. For example, when a text file is
opened in write mode, the file is seekable and the file position is
not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM
in the middle of the file. (See also Lib/test/test_io.py for related
tests.)

> 2.  Alternatively, with a little more expertise, one can rewrap
> the open binary stream in a TextIOWrapper to avoid a second
> OS file open.

That's my favorite method because you have the full control on the
stream. (I wrote tokenize.open). But yes, it does not work on
non-seekable streams (e.g. pipes).

> This too seems to read the data twice and of course the
> seek(0) prevents this method also from being usable with
> pipes, sockets and other non-seekable sources.

Does it really matter? You usually need to read few bytes to get the encoding.

> 9. In other non-read paths where encoding needs to be known,
>  raise an error if it is still None.

Why not reading data until you the encoding is known instead?

> I have modified a copy the _pyio module as described and
> the changes required seemed unsurprising and relatively
> few, though I am sure there are subtleties and other
> considerations I am missing.  Hence this post seeking
> feedback...

Can you post the modified somewhere so I can play with it?

Victor