[Python-3000] encoding hell
talin at acm.org
Mon Sep 4 01:04:34 CEST 2006
Anders J. Munch wrote:
> Watch out! There's an essentiel difference between files and
> bidirectional communications channels that you need to take into
> account. For a TCP connection, input and output can be seen as
> isolated from one another, with each their own stream position, and
> each their own contents. For read/write files, it's a whole different
> ballgame, because stream position and data are shared.
> That means you cannot use the same buffering code for both cases. For
> files, whenever you write something, you need to take into account
> that that may overlap your read buffer or change read position. You
> should take another look at layer.BufferingLayer with that in mind.
> regards, Anders
This is a better explanation of some of the comments I was raising
earlier: The choice of buffering strategy depends on a number of factors
related to how the stream is going to be used, as well as the internal
implementation of the stream. A buffering strategy that works well for a
socket won't work very well for a DBMS.
When I stated earlier that 'the OS can do a better job of buffering than
we can', what I meant to say was somewhat broader than that - which is
that each layer is, in many cases, a better judge of what *kind* of
buffering it needs than the person assembling the layers.
This doesn't mean that each layer has to implement its own buffering
algorithm. The common buffering algorithms can be factored out into
their own objects -- but what I'd suggest is that the choice of buffer
algorithm not *normally* be exposed to the person constructing the io stack.
Thus, when creating a standard "line reader", instead of having the user
fh = TextReader( Buffer( File( ... ) ) )
Instead, let the TextReader choose the kind of buffer it wants and
supply that part automatically. There are several reasons why I think
this would work better:
1) You can't simply stick just any buffer object in the middle there and
expect it to work. Different buffer strategies have different
interfaces, and trying to meld them all into one uber-interface would
make for a very complex interface.
2) The TextReader knows perfectly well what kind of buffer it needs.
Depending on how TextReader is implemented, it might want a serial,
read-only buffer that allows a limited degree of look-ahead buffering so
that it can find the line breaks. Or it might want a pair of buffers -
one decoded, one encoded. There's no way that the user can know what
kind of buffer to use without knowing the implementation details of
3) TextReader can be optimized even more if it is allowed to 'peek'
inside the internals of the buffer - something that would not be allowed
if it had to conform to calling the buffer through a standard interface.
More generally, the choice of buffer depends on the usage pattern for
reading / writing to the file - and that usage pattern is embodied in
the definition of "TextReader". By creating a "TextReader" object, the
user is stating their intention to read the file a certain way, in a
certain order, with certain performance characteristics. The choice of
buffering derives directly from those usage patterns. So the two go hand
Now, I'm not saying that you can't stick additional layers in-between
TextReader and FileStream if you want to. An example might be the
"resync" layer that you mentioned, or a journaling layer that insures
that all writes are recoverable. I'm merely saying that for the specific
issue of buffering, I think that the choice of buffer type is
complicated, and requires knowledge that might not be accessible to the
person assembling the stack.
More information about the Python-3000