[Python-Dev] RE: [Patches] [Patch #102915] xreadlines : readlines :: xrange : range

Wed, 3 Jan 2001 18:13:15 -0500

[Guido]
> Are you sure Perl still uses stdio at all?

Pretty sure, but there are so many layers of macros the code is
undecipherable, and I can't step thru macros in the debugger either (that's
assuming I wanted to devote N hours to building Perl from source too --
which I don't).  Perl also makes heavy use of macroizing std library names,
so e.g. when I see "fopen" (which I do!), that doesn't mean I'm getting the
fopen I'm thinking of.  But the MSVC config files define all sorts of macros
to get at the MS stdio _cnt and _ptr (and most other) FILE* fields, and the
version of fopen in the Win32 stuff appears to defer to the platform fopen
(after doing Perlish stuff, like if someone passed "/dev/null" as the file
name, Perl changes it to "NUL").

This is what it's like:  the first line of Perl's win32_fopen is this:

    dTHXo;

That's conditionally defined in perl.h, either as

#define dTHXo			dTHXoa(PERL_GET_THX)

or, if pTHXo is not defined, as

#  define dTHXo		dTHX

dTHX in turn is #defined in 4 different places across 3 different files in 2
different directories.  I'll skip those.  OTOH, dTHXoa is easy!  It's only
defined once:

#define dTHXoa(a)		pTHXo = a

Ah, *that* clears it up <wink>.  Etc.  20 years ago I may have thought this
was fun.  I thought debugging large systems of m4 macros was fun then, and
I'm not sure this is either better or worse than that -- well, it's worse,
because I understood m4's implementation.

> If so, does it open the file in binary or in text mode?

Sorry, but I really don't know and it's a pit to pursue.  If it's not native
text mode, they do a good job of faking it (e.g., Ctrl-Z acts like an EOF
when reading a text file from Perl on Windows -- not something even Larry
would be likely to do on his own <wink>).

> Based on the APIs in MS's libc, I presume that the crlf->lf
> translation is not done by stdio proper but by the Unix I/O
> emulation just underneath it (open() has an O_BINARY option
> flag, so read() probably does the translation).

Yes; and late in the last release cycle, import.c's open_exclusive had a
Windows bug related to this (fdopen() used "wb", but the earlier open()
didn't use O_BINARY, and fdopen *acted* like it had used "w").  Also, the MS
setmode() function works on file handles, not streams.

> That comes down to copying most bytes an extra time.

Understood.  But the CRLF are stored physically on disk, so unless the disk
controller is converting them, *someone's* software (whether MS's or Perl's)
is doing it.  By the time Perl is doing its fast line-input stuff, and doing
what sure looks like a straight copy out of an IO buffer, it's clear from
the code that CRLF has already been translated to LF.

> (To test this hypothesis, you could try to open the test file
> with mode "rb" and see if it makes a difference.)

In Python, that saved about 10% (but got the wrong answers <wink>).  In
Perl, about 15-20%.  But I don't think that tells us who's doing the
translation.  Assuming that the translation takes about the same total time
for each, it makes sense that the percentage would be higher for Perl (since
its total runtime is lower:  same-sized slice of a smaller pie).

> My biggest worry: thread-safety.  There must be a way to lock
> the file (you indicated that fgets() uses it).

Yes, via the unadvertised _lock_str and _unlock_str macros defined in MS
mtdll.h, which is not on the include path:

/*
 * This is an internal C runtime header file. It is used when building
 * the C runtimes only. It is not to be used as a public header file.
 */

The routines and macros it calls are also unadvertised.  After an hour of
thrashing I wasn't able to successfully link any code trying to call these
routines.  Doesn't mean it's impossible, does means they're internal to MS
libc and aren't meant to be called by anything else.  That's why it's called
"cheating" <wink>.  Perl appears to ignore the whole issue (but Perl's
thread story is muddy at best).

[... ungetc ...]

Not worried here either.

> ...
> You probably don't have many lines longer than 1000 characters.

None, in fact.

>> + It drops (line-at-a-time) drops to a little under 13 seconds if I
>> comment out the thread macros.

> If you mean the Py_BLOCK_THREADS around the resize, that can be safely
> dropped.

I meant *all* thread-related macros -- was just trying to get a feel for how
much that fiddling cost (it's an expense Perl doesn't seem to have -- yet).
Was measurable but not substantial.  WRT the resize, there's now a "fast
path" that avoids it.

> (If/when we introduce Vladimir's malloc, we'll have to decide whether
> it is threadsafe by itself or whether it requires the global
> interpreter lock.  I vote to make it threadsafe by itself.)

As feared, this thread is going to consume my life <0.5 wink>.

> ...
> Doesn't it make more sense to delay the resize until this point?  I
> don't know how much the character copying accounts for, but I could
> imagine a strategy based on memchr() and memcpy() that first searches
> for a \n, and if found, allocates to the right size before copying.
> Typically, the buffer contains many lines, so this could be optimized
> into requiring a single exactly-sized malloc() call in the common case
> (where the buffer doesn't wrap).  But possibly scanning the buffer for
> \n and then copying the bytes separately, even with memcmp() and
> memcpy(), slows things down too much for this to be faster.

Turns out that Perl does very much what I was doing; the Perl code is
actually more burdensome, because its routine is trying to deal not only
with \n-termination, but also arbitrary-string termination (Perl's Awk-like
input record separator), and "paragraph mode", and fixed-size reads, and
some other stuff I can't figure out from the macro names.  In all cases with
a terminator, though, it's doing the same business of both copying and
testing in a very tight inner loop.  It doesn't appear to make any serious
attempts to avoid resizing the buffer.  But, Perl has its own malloc
routines, and I'm guessing they're highly tuned for this stuff.

Since we're stuck with the MS malloc-- and Win9x's in particular seems
lame --adding this near the start of my stuff did yield a nice speedup:

	if (fp->_cnt > 0 &&
	    (pBuf = (char *)memchr(fp->_ptr, '\n', fp->_cnt)) != NULL) {
	    	/* it's all in the buffer so don't bother releasing the
	    	 * global lock
	    	 */
		total_buf_size = pBuf - fp->_ptr + 1;
		v = PyString_FromStringAndSize(fp->_ptr,
			                       (int)total_buf_size);
		if (v != NULL) {
			pBuf = BUF(v) + total_buf_size;
			fp->_cnt -= total_buf_size;
			fp->_ptr += total_buf_size;
		}
		goto done;
	}

So that builds the result string directly from the stdio buffer when it can.
Times dropped from (before this particular small hack)

count_chars_lines    14.880 14.854
readlines_sizehint    9.280  9.302
using_fileinput      48.610 48.589
while_readline       13.450 13.451

to

count_chars_lines    14.780 14.784
readlines_sizehint    9.550  9.514
using_fileinput      43.560 43.584
while_readline       10.600 10.578

Since I have no long lines in this test data, and the stdio buffer typically
contains thousands of chars, most calls should be satisfied by the fast
path.  Compared to the previous code, the fast path (1) avoids global lock
fiddling (but that didn't account for much in a distinct test); (2) crawls
over the buffer twice instead of once; and, (3) avoids one (shrinking!)
realloc.  So crawling over the buffer an extra time costs nothing compared
to the cost of a resize; and that's likely just more evidence that
malloc/realloc suck on this platform.

CAUTION:  no file locking is going on now (because I haven't found a way to
do it).  My previous claim that the MS getc macro did no locking was wrong,
as I discovered by stepping thru the generated machine code.  stdio.h
#defines getc without locking, but in _MT mode it later gets #undef'ed and
turned into a function call.

>> /* XXX unclear to me why MS's getc macro does "& 0xff" */
>>			*pBuf++ = ch = *ms_ptr++ & 0xff;

> I know why.  getchar() returns an int in the range [-1, 255].  If
> chars are signed the &0xff is needed else you would get a return in
> the range [-128, 127] and -1 would be ambiguous (EOF==-1).

Bingo -- MS chars are signed.

> ...
> But here since you're copying to another character, it's pointless.

Yup!  Gone.

> ....
> Note that get_line() with negative n could be implemented as
> get_line(0) with some post processing.

Andrew's glibc getline code appears to have wanted to do that, but looks to
me like it's unreachable (unless I'm hallucinating, the "n < 0" test after
return from glibc getline can't succeed, because the enclosing block is
guarded by an "n==0" test).

> This should be done completely separately, in PyFile_GetLine.

I assume you have an editor <wink>.

> The negative n case is only used by raw_input() -- it means strip
> the \n and raise EOFError for EOF, and I expect that this is rarely
> if ever used in a speed-conscious situation.

I've never seen raw_input used except when stdin and stdout were connected
to a tty.  When I tried raw_input from a DOS box under the debugger, it
never called get_line.  Something trickier is going on there; I suspect it's
actually calling fgets (eventually) instead in that case.

more-mysteries-than-i-really-need-ly y'rs  - tim