[Python-3000] Comment on iostack library
talin at acm.org
Wed Aug 30 07:26:59 CEST 2006
Guido van Rossum wrote:
> On 8/29/06, Talin <talin at acm.org> wrote:
>> I've been thinking more about the iostack proposal. Right now, a typical
>> file handle consists of 3 "layers" - one representing the backing store
>> (file, memory, network, etc.), one for adding buffering, and one
>> representing the program-level API for reading strings, bytes, decoded
>> text, etc.
>> I wonder if it wouldn't be better to cut that down to two. Specifically,
>> I would like to suggest eliminating the buffering layer.
>> My reasoning is fairly straightforward: Most file system handles,
>> network handles and other operating system handles already support
>> buffering, and they do a far better job of it than we can. The handles
>> that don't support buffering are memory streams - which don't need
>> buffering anyway.
>> Of course, it would make sense for Python to provide its own buffering
>> implementation if we were going to always use the lowest-level i/o API
>> provided by the operating system, but I can't see why we would want to
>> do that. The OS knows how to allocate an optimal buffer, using
>> information such as the block size of the filesystem, whereas trying to
>> achieve this same level of functionality in the Python standard library
>> would be needlessly complex IMHO.
> I'm not sure I follow.
> We *definitely* don't want to use stdio -- it's not part of the OS
> anyway, and has some annoying quirks like not giving you any insight
> in how it is using the buffer, nor changing the buffer size on the
> fly, and crashing when you switch read and write calls.
> So given that, how would you implement readline()? Reading one byte at
> a time until you've got the \n is definitely way too slow given the
> constant overhead of system calls.
> Regarding optimal buffer size, I've never seen a program for which 8K
> wasn't optimal. Larger buffers simply don't pay off.
Well, as far as readline goes: In order to split the text into lines,
you have to decode the text first anyway, which is a layer 3 operation.
You can't just read bytes until you get a \n, because the file you are
reading might be encoded in UCS2 or something. So for example, in a
big-endian UCS2 encoding, newline would be encoded as 0x00 0x0a, whereas
in a little-endian UCS2 encoding, it would be 0x0A 0x00. Merely stopping
at the 0x0A byte is incorrect, you've only read half the character.
You're correct that reading by line does require a buffer if you want to
do it efficiently. However, in a world of character encodings, the
readline buffer has to be implemented at a higher level in the IO stack,
at the same level which understands text encodings. There may be a
different set of buffers at the lower level to minimize the number of
disk i/o operations, but they can't really be the same buffer -- either
that, or the text encoding layer will need to have fairly incestuous
knowledge of what's going on at the lower layers so that it can peek
inside its buffers.
It seems to me that no matter how you slice it, you can't have an
abstract "buffering" layer that is independent of both the layer beneath
and the layer above. Both the text decoding layer and the disk i/o layer
need to have fairly intimate knowledge of their buffers if you want
maximum efficiency. (I'm not opposed to a custom implementation of
buffering in the level 1 file object itself, although I suspect in most
cases you'd be better off using what the OS or its standard libs provide.)
As far as stdio not giving you hints as to how it is using the buffer, I
am not sure what you mean...what kind of information would a custom
buffer implementation give you that stdio would not? If its early
detection of \n is what you are thinking of, I've already shown that
won't work unless you are assuming an 8-bit encoding.
More information about the Python-3000