[Patches] Parsing strings with \r\n or \r

Sun, 28 May 2000 19:32:30 -0400

Hi all --

I've done a little digging into the parsing problem that I noticed a few
days, namely that CR and CR-LF line-endings on Unix are accepted when
parsing a file (import), but not when parsing a string
(py_compile.compile(), but also visible with the builtin compile()).

Of course, it's really a tokenizing problem, not a parsing problem.  But
that's just being pedantic.  I cooked up an idiot-simple patch to
tokenize.c that slaps a band-aid on the problem, but I'm not sure if
it's the best solution.  First, here's the patch:

*** tokenizer.c.orig	Sun May 28 17:30:25 2000
--- tokenizer.c	Sun May 28 19:09:29 2000
***************
*** 644,649 ****
--- 644,669 ----
  		return NEWLINE;
  	}

+ 	/* MS-DOS or Mac newline leaking through on Unix */
+ #ifdef unix
+ 	if (c == '\r') {
+ 		c = tok_nextc(tok);
+ 		tok->atbol = 1;
+ 		if (blankline || tok->level > 0)
+ 			goto nextline;
+ 		*p_start = tok->start;
+ 		if (c != '\n') {
+ 			tok_backup(tok, c);
+ 			*p_end = tok->cur - 1; /* leave \r out */
+ 		}
+ 		else {
+ 			*p_end = tok->cur - 2; /* leave \r\n out */
+ 		}
+ 
+ 		return NEWLINE;
+ 	}		
+ #endif
+ 
  #ifdef macintosh
  	if (c == '\r') {
  		PySys_WriteStderr(

The first problem with this is that the logic overlaps with the
immediately previous case, "if (c == '\n')".  That can be fixed by
merging the two and making the logic more complicated, which I didn't do 
because I wanted the patch to be comprehensible.

The bigger problem is that there is *already* code for dealing with \r
in the tokenizer.  Near the bottom if tokenizer.c, we see:

#ifndef macintosh
			/* replace "\r\n" with "\n" */
			/* For Mac we leave the \r, giving a syntax error */
			pt = tok->inp - 2;
			if (pt >= tok->buf && *pt == '\r') {
				*pt++ = '\n';
				*pt = '\0';
				tok->inp = pt;
			}
#endif

-- so that explains why \r\n in *files* is OK.  But apparently, this
code never runs when parsing strings -- I checked this by putting in
some printf's; it looks like we hit this code for every EOL plus EOF
when parsing a file, but never when parsing a string.

Besides this weakness, it seems kind of sneaky to turn \r\n into \n in
the routine that's just supposed to be fetching characters from a
buffer.  If my patch can be generalized, it strikes me as a more elegant
solution to the problem: if we see \n, or \r\n, or \r in the input
buffer, treat them all the same.  (At least on Unix, where the C library
does no translation of newlines.)

A quick test: if I remove the "\r\n -> \n" hack, but leave my patch in,
it looks like the parser still handles Unix, MS-DOS, and Mac OS line
endings just fine.  Oops, not quite: if a bare "\r" ends a comment line,
the following code line is swallowed by the comment.  Well, I'd like to
hear if this sounds like the right solution before worrying about that
-- let me know what you think and I'll refine the patch.

        Greg
-- 
Greg Ward - geek-on-the-loose                           gward@python.net
http://starship.python.net/~gward/
Condense soup, not books!