[Python-ideas] TextIOWrapper callable encoding parameter
rurpy at yahoo.com
Wed Jun 13 17:46:01 CEST 2012
On 06/11/2012 10:24 AM, Stephen J. Turnbull wrote:
> > Nick Coghlan writes:
> > > Immediate thought: it seems like it would be easier to offer a way to
> > > inject data back into a buffered IO object's internal buffer.
> > ungetch()?
What would be the TextIOWrapper api for that?
> > If you're only interested in the top of the file (see below), I would
> > suggest allowing only one bufferfull, and then simply rewinding the
> > buffer pointer once you're done. This is one strategy used by Emacsen
> > for encoding detection (for the reason pointed out by Rurpy: not all
> > streams are rewindable).
> > But is that really "easier"? It might be more general, but you still
> > need to reinitialize the encoding (ie, from the trivial "binary" to
> > whatever is detected), with all the hair that comes with that.
I don't think there is any hair involved. In at least
the _pyio version of TextIOWrapper, initializing the
encoding (in the read path) consists of calling
self._get_decoder(). One needs to move the few places
where that is called now to nearby places that are
after the raw buffer has been read but before it is
decoded. There may be need for some consideration
given to raising errors at the old locations in the
case the callable encoding hook is not being used (to
maintain complete backwards compatibility; not sure
that is necessary), but I wouldn't call that hairy.
Of course there may be other factors I am missing...
> > > > Executive summary:
> > > > ==================
> > > >
> > > > There is no good way to read a text file when the
> > > > encoding has to be determined by reading the start
> > > > of the file. A long-winded version of that follows.
> > > > Scroll down the the "Proposal" section to skip it.
> > This may be insufficiently general. Specifically, both Emacsen and vi
> > allow specification of editor configuration variables at the bottom of
> > the file as well as the top. I don't know whether vi allows encoding
> > specs at the bottom, but Emacsen do (but only for files).
> > I wouldn't recommend paying much attention to what Emacsen actually
> > *do* when initializing a stream (it's, uh, "baroque").
Looking only at the beginning of an input stream is
general enough for a large class of problems including
tokenizing python source code.
More information about the Python-ideas