[Python-Dev] xreadline speed vs readlines_sizehint

Mark Favas m.favas@per.dem.csiro.au
Thu, 11 Jan 2001 11:40:18 +0800


[Tim responded]
>>
>> total 131426612 chars and 514216 lines

>You average over 255 chars/line?  Really?  What kind of file are you
>reading?  I don't really want to measure the speed of line-at-a-time >input on binary files where "line" doesn't actually make sense <0.6 wink>.

Real-life input, my boy! It's actually a syslog from my mailserver,
consisting mainly of sendmail log messages, and I have a current need to
process these things (MS Exchange, corrupted database, clobbered backup
tapes), so this thread came along at the right time...

>Guido pointed out that his readlines_sizehint test forced use of a 1Mb
>buffer (in the call, not only the default value).  For whatever >reason, that was significantly slower than using an 8Kb sizehint on my >box.

Removing the buffer size arg in the call to readlines_sizehint results
in this (using up-to-the-minute CVS):
total 131426612 chars and 514216 lines
count_chars_lines     4.922  4.916
readlines_sizehint    3.881  3.850
using_fileinput      10.371 10.366
while_readline       10.943 10.916
for_xreadlines        2.990  2.967

and with an 8Kb sizehint:
total 131426612 chars and 514216 lines
count_chars_lines     5.241  5.216
readlines_sizehint    2.917  2.900
using_fileinput      10.351 10.333
while_readline       10.990 10.983
for_xreadlines        2.877  2.867


>Another oddity is that while_readline is slower than using_fileinput >for you.  From that I take it Python config does *not* #define
>
>     HAVE_GETC_UNLOCKED
>
>on your platform.  If that's true 

Nope, HAVE_GETC_UNLOCKED is indeed #define'd

>(or esp. if it's not!), would you do me a
>favor?  Recompile fileobject.c with
>
>     USE_MS_GETLINE_HACK
>
>#define'd, try the timing test again (while_readline is the most >interesting test for this), and run the test_bufio.py std test to make >sure you're actually getting the right answers.

Sure:
With USE_MS_GETLINE_HACK and HAVE_GETC_UNLOCKED both #define'd (although
defining the former makes the latter def irrelevant):
(test_bufio also OK)
total 131426612 chars and 514216 lines
count_chars_lines     5.056  5.050
readlines_sizehint    3.771  3.667
using_fileinput      11.128 11.116
while_readline        8.287  8.233
for_xreadlines        3.090  3.083

With USE_MS_GETLINE_HACK and HAVE_GETC_UNLOCKED both #undef'ed (just for
completeness):
total 131426612 chars and 514216 lines
count_chars_lines     4.916  4.900
readlines_sizehint    3.875  3.867
using_fileinput      14.404 14.383
while_readline       322.728 321.837
for_xreadlines        7.113  7.100

So, having HAVE_GETC_UNLOCKED #define'd does make a small improvement
<grin>

-- 
Mark Favas  -   m.favas@per.dem.csiro.au
CSIRO, Private Bag No 5, Wembley, Western Australia 6913, AUSTRALIA