On 06/11/2012 10:24 AM, Stephen J. Turnbull wrote:
Nick Coghlan writes:
Immediate thought: it seems like it would be easier to offer a way to
inject data back into a buffered IO object's internal buffer.
ungetch()?
What would be the TextIOWrapper api for that?
If you're only interested in the top of the file (see below), I would
suggest allowing only one bufferfull, and then simply rewinding the
buffer pointer once you're done. This is one strategy used by Emacsen
for encoding detection (for the reason pointed out by Rurpy: not all
streams are rewindable).
But is that really "easier"? It might be more general, but you still
need to reinitialize the encoding (ie, from the trivial "binary" to
whatever is detected), with all the hair that comes with that.
I don't think there is any hair involved. In at least
the _pyio version of TextIOWrapper, initializing the
encoding (in the read path) consists of calling
self._get_decoder(). One needs to move the few places
where that is called now to nearby places that are
after the raw buffer has been read but before it is
decoded. There may be need for some consideration
given to raising errors at the old locations in the
case the callable encoding hook is not being used (to
maintain complete backwards compatibility; not sure
that is necessary), but I wouldn't call that hairy.
Of course there may be other factors I am missing...
Executive summary:
==================
There is no good way to read a text file when the
encoding has to be determined by reading the start
of the file. A long-winded version of that follows.
Scroll down the the "Proposal" section to skip it.
This may be insufficiently general. Specifically, both Emacsen and vi
allow specification of editor configuration variables at the bottom of
the file as well as the top. I don't know whether vi allows encoding
specs at the bottom, but Emacsen do (but only for files).
I wouldn't recommend paying much attention to what Emacsen actually
*do* when initializing a stream (it's, uh, "baroque").
Looking only at the beginning of an input stream is
general enough for a large class of problems including
tokenizing python source code.