[Python-Dev] RE: [Patches] [Patch #102915] xreadlines : readlines :: xrange : range

Wed, 03 Jan 2001 07:37:11 -0500

> 1. That Perl still takes only 6 seconds in line-at-a-time mode.

Are you sure Perl still uses stdio at all?

If so, does it open the file in binary or in text mode?  Based on the
APIs in MS's libc, I presume that the crlf->lf translation is not done
by stdio proper but by the Unix I/O emulation just underneath it
(open() has an O_BINARY option flag, so read() probably does the
translation).  That comes down to copying most bytes an extra time.

(To test this hypothesis, you could try to open the test file with
mode "rb" and see if it makes a difference.)

> 2. I originally wrote a getline workalike, instead of building
> directly into a PyString buffer.  That made my test run *slower*,
> and I'm talking factor of 2, not a yawn.  To judge from my usually
> silent disk (I've got 256Mb RAM on this box), I'm afraid the extra
> mallocs required may have triggered the horrid Win9x
> malloc-thrashing problem I wrote about while I was still at Dragon.
> Consider that another vote for Vlad's PyMalloc -- we've got no
> handle on x-platform dynamic memory behavior now.  Python's destiny
> is to replace both the platform OS and libc anyway <0.9 wink>.
>
> The scary parts:
>
> + As the "XXX" comments indicate, this is full of little
> insecurities.

My biggest worry: thread-safety.  There must be a way to lock the file
(you indicated that fgets() uses it).

> + Another one I just thought of: if the user's last operations on
> the fp were two or more consecutive ungetc calls, all bets are off.
> But then MS doesn't define what happens then either.

Python doesn't have an interface to ungetc(), and I believe the stdio
standard says you can only call ungetc() once consecutively.  Assuming
other C code linked with Python obeys this rule (a pretty safe
assumption), we should be fine.  And if the assumption is violated, I
presume it's really that C code's fault -- plus, it code that only
uses getc() would be screwed just as badly.

> + This is much less ambitious than I recall Perl's code being: it
> doesn't try to guess anything about the file, and effectively
> captures only what would happen if you could unroll the guts of a
> getc-in-a-loop and optimize the snot out of it.  The good news is
> that this means it's much easier to maintain (it touches only two of
> the MS FILE* fields, and in ways that are pretty obviously correct).
> The bad news is that this seems also pretty clearly all there *is*
> to be gotten out of breaking into the FILE* abstraction for the
> particular test case I'm using; and increasing TUNEME doesn't save
> any time at all: the sucker is flying at full speed already.

You probably don't have many lines longer than 1000 characters.

> + It drops (line-at-a-time) drops to a little under 13 seconds if I
> comment out the thread macros.

If you mean the Py_BLOCK_THREADS around the resize, that can be safely
dropped.  (If/when we introduce Vladimir's malloc, we'll have to
decide whether it is threadsafe by itself or whether it requires the
global interpreter lock.  I vote to make it threadsafe by itself.)

> + I haven't looked at Perl's implementation in a year, and they must
> have dreamt up another trick since then.  That's a "scary part"
> indeed to anyone who has ever looked at Perl's implementation.
>
> retreating-into-a-fetal-position-ly y'rs - tim
> 
> 
> Anyone wants to play, the sandbox is fileobject.c.  Do two things:
> insert this new chunk somewhere above get_line:
> 
> #ifdef MS_WIN32
> static PyObject*
> win32_getline(FILE *fp)
> {
> 	/* XXX ignores thread safety -- but so does MS's getc macro! */
> 	PyObject* v;
> 	char* pBuf;	/* next free slot in v's buffer */
> 	/* MS's internals are declared in terms of ints, but it's a sure bet
> 	 * that won't last forever -- use size_t now & live w/ the casting;
> 	 * ditto for Python's routines
> 	 */
> 	size_t total_buf_size = 100;
> 	size_t free_buf_size = total_buf_size;
> #define TUNEME 1000	/* how much to boost the string buffer when exhausted */
> 
> 	v = PyString_FromStringAndSize((char *)NULL, (int)total_buf_size);
> 	if (v == NULL)
> 		return NULL;
> 	pBuf = BUF(v);
> 	Py_BEGIN_ALLOW_THREADS
> 	for (;;) {
> 		char ch;
> 		size_t ms_cnt;	/* FILE->_cnt shadow */
> 		char* ms_ptr;	/* FILE->_ptr shadow */
> 		size_t max_to_copy, i;
> 		/* stdio buffer empty or in unknown state; rather
> 		 * than try to simulate every quirk of MS's internals,
> 		 * let the MS macros deal with it.
> 		 */
> 		/* XXX we also wind up here when we simply run out of string
> 		 * XXX buffer space, but I'm not sure I care:  making this a
> 		 * XXX double-nested loop doesn't seem worth it
> 		 */
> 		ch = getc(fp);
> 		if (ch == EOF)
> 			break;
> 		/* make sure we've got some breathing room */
> 		if (free_buf_size < 100) {
> 			size_t currentoffset = pBuf - BUF(v);
> 			total_buf_size += TUNEME;  /* XXX check for overflow */
> 			Py_BLOCK_THREADS
> 			if (_PyString_Resize(&v, (int)total_buf_size) < 0)
> 				return NULL;
> 			Py_UNBLOCK_THREADS
> 			pBuf = BUF(v) + currentoffset;
> 			free_buf_size = TUNEME;
> 		}
> 		/* ch wasn't EOF, so store it */
> 		*pBuf++ = ch;
> 		--free_buf_size;
> 		if (ch == '\n') {
> 			break;
> 		}
> 		ms_cnt = (size_t)fp->_cnt;
> 		if (!ms_cnt) {
> 			/* XXX this is a slow way to read one character at
> 			 * XXX a time if, e.g., the stream is unbuffered
> 			 */
> 			continue;
> 		}
> 		/* payback!  now we don't have to check for buffer overflows or
> 		 * EOF inside the loop, nor does the macro _filbuf() branch force
> 		 *  _ptr and _cnt in and out of memory on each iteration
> 		 */
> 		ms_ptr = fp->_ptr;
> 		assert(ms_cnt > 0);
> 		i = max_to_copy = ms_cnt < free_buf_size ? ms_cnt : free_buf_size;

Doesn't it make more sense to delay the resize until this point?  I
don't know how much the character copying accounts for, but I could
imagine a strategy based on memchr() and memcpy() that first searches
for a \n, and if found, allocates to the right size before copying.
Typically, the buffer contains many lines, so this could be optimized
into requiring a single exactly-sized malloc() call in the common case
(where the buffer doesn't wrap).  But possibly scanning the buffer for
\n and then copying the bytes separately, even with memcmp() and
memcpy(), slows things down too much for this to be faster.

> 		do {
> 			/* XXX unclear to me why MS's getc macro does "& 0xff" */
> 			*pBuf++ = ch = *ms_ptr++ & 0xff;

I know why.  getchar() returns an int in the range [-1, 255].  If
chars are signed the &0xff is needed else you would get a return in
the range [-128, 127] and -1 would be ambiguous (EOF==-1).  Not sure
if they *are* unsigned on any MS platform -- if they aren't, whoever
coded this wasn't thinking -- on the other hand the compiler probagbly
optimizes it out.  But here since you're copying to another character,
it's pointless.

> 		} while (--i && ch != '\n');
> 		/* update the shadows & counters */
> 		fp->_ptr = ms_ptr;
> 		free_buf_size -= max_to_copy - i;
> 		fp->_cnt = ms_cnt - (max_to_copy - i);
> 		if (ch == '\n')
> 			break;
> 	}
> 	Py_END_ALLOW_THREADS
> 	_PyString_Resize(&v, pBuf - BUF(v));
> 	return v;
> }
> #endif
> 
> 2. Within get_line, add this before the #endif (this is the getline #if block):
> 
> #elif defined(MS_WIN32)
> 	if (n == 0) {
> 		return win32_getline(fp);
> 	}

Note that get_line() with negative n could be implemented as
get_line(0) with some post processing.  This should be done completely
separately, in PyFile_GetLine.  The negative n case is only used by
raw_input() -- it means strip the \n and raise EOFError for EOF, and I
expect that this is rarely if ever used in a speed-conscious
situation.

--Guido van Rossum (home page: http://www.python.org/~guido/)