[Python-Dev] xreadlines : readlines :: xrange : range

Thomas Wouters thomas@xs4all.net
Thu, 4 Jan 2001 15:59:05 +0100

On Thu, Jan 04, 2001 at 09:16:39AM -0500, Guido van Rossum wrote:
> [Thomas finds that on FreeBSD, getc() is faster than getc_unlocked().]

> Thomas, I really don't understand it.  The getc() source code you
> showed calls getc_unlocked().  So how can it be faster?  The answer
> must be somewhere else...  Cache line conflicts, the rewriting of the
> loop that I did, a compiler bug, the inlining, who knows.  Can you
> compare the generated assembly code?  On other platforms,
> getc_unlocked() typically speeds the readline() test case up by a
> significant factor (as in your BSDI numbers, where it's almost 3x
> faster).

Nono, reread my message, and your code. getc() isn't faster than
getc_unlocked(). getc() is faster than flockfile(f) + getc_unlocked(f) (+
the rearranging of the function, use of PyTHREAD_ALLOW inside the outer loop,
etc.) Significantly so when there is only one thread running (which is still
the common case, for most systems, and FreeBSD's libc has easy inside
knowledge about) and marginally so when there is at least one other thread.
The small advantage in the multi-threaded case can be explained by the
rest of the changes. 

You see, I was comparing a patched tree versus a non-patched tree, not a
getc_unlocked() enabled one versus a disabled one, so I was measuring the
speed difference of the *patch*, not of the use of getc_unlocked() vs
getc(). Here is the speed difference of just the use of getc() vs
getc_unlocked() (same tree, hand-edited config.h) in a non-threaded

> ./python-getc-disabled ~/test.py ~/termcapx10
total 1794310 chars and 37660 lines
count_chars_lines     0.271  0.273
readlines_sizehint    0.149  0.148
using_fileinput       0.898  0.898
while_readline        0.214  0.211

> ./python-getc-enabled ~/test.py ~/termcapx10
total 1794310 chars and 37660 lines
count_chars_lines     0.271  0.273
readlines_sizehint    0.148  0.148
using_fileinput       0.898  0.898
while_readline        0.214  0.211

As you see, no significant difference. Here is the difference in a threaded
environment (a second thread that does just 'time.sleep(900)'):

> ./python-getc-disabled ~/test.py ~/termcapx10
total 1794310 chars and 37660 lines
count_chars_lines     0.429  0.422
readlines_sizehint    0.200  0.211
using_fileinput       1.604  1.594
while_readline        0.465  0.461

> ./python-getc-enabled ~/test.py ~/termcapx10
total 1794310 chars and 37660 lines
count_chars_lines     0.429  0.430
readlines_sizehint    0.201  0.203
using_fileinput       1.600  1.602
while_readline        0.463  0.461

... where I have to note that the getc-disabled version's 'using_fileinput'
time fluctuates a lot more, mostly upwards, in the threaded environment. (I
see it jump to 1.609, 1.617 cputime, every few runs.) Still not a terribly
significant difference, but a hint that we, too, can use inside knowledge ;)

> Could it be that you're mistaken and that somehow getc_unlocked() is
> *not* chosen on FreeBSD?  Then I could believe it, the rewritten loop
> is so different that the optimizer might have done something different
> to it.  (Check config.h.  When all else fails, I put an #error in the
> #ifdef branch that I expect not to be taken.)

Yah, #error is great for debugging, I use it a lot ;) But I'm sure of this.
FreeBSD's getc() is just craftily optimized. Note that if we can get
get_line using getc_unlocked() to run as fast as get_line using getc() on
FreeBSD, it should also benifit other platforms, because the only speed to
be had is in our own code :) Not that I'm saying it can be improved, just
that it apparently got slower, because of this patch. I can't be much help
doing any performance tuning, though, I've about used up my lunchhour and
I'm working late tonight ;P

Thomas Wouters <thomas@xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!