[Python-3000] Comment on iostack library

Wed Aug 30 07:26:59 CEST 2006

Guido van Rossum wrote:
> On 8/29/06, Talin <talin at acm.org> wrote:
>> I've been thinking more about the iostack proposal. Right now, a typical
>> file handle consists of 3 "layers" - one representing the backing store
>> (file, memory, network, etc.), one for adding buffering, and one
>> representing the program-level API for reading strings, bytes, decoded
>> text, etc.
>>
>> I wonder if it wouldn't be better to cut that down to two. Specifically,
>> I would like to suggest eliminating the buffering layer.
>>
>> My reasoning is fairly straightforward: Most file system handles,
>> network handles and other operating system handles already support
>> buffering, and they do a far better job of it than we can. The handles
>> that don't support buffering are memory streams - which don't need
>> buffering anyway.
>>
>> Of course, it would make sense for Python to provide its own buffering
>> implementation if we were going to always use the lowest-level i/o API
>> provided by the operating system, but I can't see why we would want to
>> do that. The OS knows how to allocate an optimal buffer, using
>> information such as the block size of the filesystem, whereas trying to
>> achieve this same level of functionality in the Python standard library
>> would be needlessly complex IMHO.
> 
> I'm not sure I follow.
> 
> We *definitely* don't want to use stdio -- it's not part of the OS
> anyway, and has some annoying quirks like not giving you any insight
> in how it is using the buffer, nor changing the buffer size on the
> fly, and crashing when you switch read and write calls.
> 
> So given that, how would you implement readline()? Reading one byte at
> a time until you've got the \n is definitely way too slow given the
> constant overhead of system calls.
> 
> Regarding optimal buffer size, I've never seen a program for which 8K
> wasn't optimal. Larger buffers simply don't pay off.

Well, as far as readline goes: In order to split the text into lines, 
you have to decode the text first anyway, which is a layer 3 operation. 
You can't just read bytes until you get a \n, because the file you are 
reading might be encoded in UCS2 or something. So for example, in a 
big-endian UCS2 encoding, newline would be encoded as 0x00 0x0a, whereas 
in a little-endian UCS2 encoding, it would be 0x0A 0x00. Merely stopping 
at the 0x0A byte is incorrect, you've only read half the character.

You're correct that reading by line does require a buffer if you want to 
do it efficiently. However, in a world of character encodings, the 
readline buffer has to be implemented at a higher level in the IO stack, 
at the same level which understands text encodings. There may be a 
different set of buffers at the lower level to minimize the number of 
disk i/o operations, but they can't really be the same buffer -- either 
that, or the text encoding layer will need to have fairly incestuous 
knowledge of what's going on at the lower layers so that it can peek 
inside its buffers.

It seems to me that no matter how you slice it, you can't have an 
abstract "buffering" layer that is independent of both the layer beneath 
and the layer above. Both the text decoding layer and the disk i/o layer 
need to have fairly intimate knowledge of their buffers if you want 
maximum efficiency. (I'm not opposed to a custom implementation of 
buffering in the level 1 file object itself, although I suspect in most 
cases you'd be better off using what the OS or its standard libs provide.)

As far as stdio not giving you hints as to how it is using the buffer, I 
am not sure what you mean...what kind of information would a custom 
buffer implementation give you that stdio would not? If its early 
detection of \n is what you are thinking of, I've already shown that 
won't work unless you are assuming an 8-bit encoding.

-- Talin