[Python-Dev] End of the line

Tim Peters tim_one@email.msn.com
Tue, 13 Jul 1999 23:56:15 -0400


[Tim]
> ... Icon ... sprouted an interesting change in semantics:  if you open
> a file for reading in ...text mode ... it normalizes Unix, Mac and
> Windows line endings to plain \n.  Writing in text mode still produces
> what's natural for the platform.

[Guido]
> I've been thinking about this myself -- exactly what I would do.

Me too <wink>.

> Not clear how easy it is to implement (given that I'm not so enthused
> about the idea of rewriting the entire I/O system without using stdio
> -- see archives).

The Icon implementation is very simple:  they *still* open the file in stdio
text mode.  "What's natural for the platform" on writing then comes for
free.  On reading, libc usually takes care of what's needed, and what
remains is to check for stray '\r' characters that stdio glossed over.  That
is, in fileobject.c, replacing

		if ((*buf++ = c) == '\n') {
			if (n < 0)
				buf--;
			break;
		}

with a block like (untested!)

		*buf++ = c;
		if (c == '\n' || c == '\r') {
			if (c == '\r') {
				*(buf-1) = '\n';
				/* consume following newline, if any */
				c = getc(fp);
				if (c != '\n')
					ungetc(c, fp);
			}
			if (n < 0)
				buf--;
			break;
		}

Related trickery needed in readlines.  Of course the '\r' business should be
done only if the file was opened in text mode.

> The implementation must be as fast as the current one -- people used
> to complain bitterly when readlines() or read() where just a tad
> slower than they *could* be.

The above does add one compare per character.  Haven't timed it.  readlines
may be worse.

BTW, people complain bitterly anyway, but it's in comparison to Perl text
mode line-at-a-time reads!

D:\Python>wc a.c
1146880 3023873 25281537 a.c

D:\Python>

Reading that via

def g():
    f = open("a.c")
    while 1:
        line = f.readline()
        if not line:
            break

and using python -O took 51 seconds.  Running the similar Perl (although
it's not idiomatic Perl to assign each line to an explict var, or to test
that var in the loop, or to use "if !" instead of "unless" -- did all those
to make it more like the Python):

open(DATA, "<a.c");
while ($line = <DATA>) {last if ! $line;}

took 17 seconds.  So when people are complaining about a factor of 3, I'm
not inclined to get excited about a few percent <wink>.

> There's a lookahead of 1 character needed -- ungetc() might be
> sufficient except that I think it's not guaranteed to work on
> unbuffered files.

Don't believe I've bumped into that.  *Have* bumped into problems with
ungetc not playing nice with fseek/ftell, and that's probably enough to kill
it right there (alas).

> Should also do this for the Python parser -- there it would be a lot
> easier.

And probably the biggest bang for the buck.

the-problem-with-exposing-libc-is-that-libc-isn't-worth-exposing<wink-ly
    y'rs  - tim