standard input, for s in f, and buffering

Tue Apr 1 07:34:05 EDT 2008

On Mon, 31 Mar 2008 22:27:39 -0700 (PDT), Paddy <paddy3118 at googlemail.com> wrote:
> On Mar 31, 11:47 pm, Jorgen Grahn <grahn+n... at snipabacken.se> wrote:
>> On 31 Mar 2008 06:54:29 GMT, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
>>
>> > On Sun, 30 Mar 2008 21:02:44 +0000, Jorgen Grahn wrote:
>>
>> >> I realize this has to do with the extra read-ahead buffering documented for
>> >> file.next() and that I can work around it by using file.readline()
>> >> instead.

>> > You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,'')``
>> > as iterable for `lines`.
>>
>> Thanks.  I wasn't aware that building an iterator was that easy. The
>> tiny example program then becomes

>> By the way, I timed the three solutions given so far using 5 million
>> lines of standard input.  It went like this:
>>
>>   for s in file     :  1
>>   iter(readline, ''):  1.30  (i.e. 30% worse than for s in file)
>>   while 1           :  1.45  (i.e. 45% worse than for s in file)
>>   Perl while(<>)    :  0.65
>>
>> I suspect most of the slowdown comes from the interpreter having to
>> execute more user code, not from lack of extra heavy input buffering.

> Hi Juergen,
> From the python manpage:
>     -u     Force  stdin,  stdout  and stderr to be totally unbuffered.
>            On systems where it matters, also put stdin, stdout and
>            stderr in binary mode.  Note that there is internal
>            buffering in xreadlines(), readlines() and file-object
>            iterators ("for line in sys.stdin") which is not influenced
>            by this option.  To work around this, you will want to use
>            "sys.stdin.readline()" inside a "while 1:" loop.

> Maybe try adding the python -u option?

Doesn't help when the code is in a module, unfortunately.

> Buffering is supposed to help when processing large amounts of I/O,
> but gives the 'many lines in before any output' that you saw
> originally.

"Is supposed to help", yes.  I suspect (but cannot prove) that the
kind of buffering done here doesn't buy more than 10% or so even in
artificial tests, if you consider the fact that "for s in f" is in
itself a faster construct than my workarounds in user code.

Note that even with buffering, there seems to be one system call per
line when used interactively, and lines are of course passed to user
code one by one.

Lastly, there is still the question about having to press Ctrl-D twice
to end the loop, which I mentioned my the original posting.  That
still feels very wrong.

> If the program is to be mainly used to handle millions of
> lines from a pipe or file, then why not leave the buffering in?
> If you need both interactive and batch friendly I/O modes you might
> need to add the ability to switch between two modes for your program.

That is exactly the tradeoff I am dealing with right now, and I think
I have come to the conclusion that I want no buffering.

My source data set can be huge (gigabytes of text) but in reality it
is boiled down to at most 50000 lines by a Perl script further to the
left in my pipeline:

  zcat foo.gz | perl | python > bar

The Perl script takes ~100 times longer time to execute, and both are
designed as filters, which means a modest increase in CPU time for the
Python script isn't visible to the end user.

/Jorgen

-- 
  // Jorgen Grahn <grahn@        Ph'nglui mglw'nafh Cthulhu
\X/     snipabacken.se>          R'lyeh wgah'nagl fhtagn!