[Patches] Parsing strings with \r\n or \r
Greg Ward
gward@python.net
Sun, 28 May 2000 19:32:30 -0400
Hi all --
I've done a little digging into the parsing problem that I noticed a few
days, namely that CR and CR-LF line-endings on Unix are accepted when
parsing a file (import), but not when parsing a string
(py_compile.compile(), but also visible with the builtin compile()).
Of course, it's really a tokenizing problem, not a parsing problem. But
that's just being pedantic. I cooked up an idiot-simple patch to
tokenize.c that slaps a band-aid on the problem, but I'm not sure if
it's the best solution. First, here's the patch:
*** tokenizer.c.orig Sun May 28 17:30:25 2000
--- tokenizer.c Sun May 28 19:09:29 2000
***************
*** 644,649 ****
--- 644,669 ----
return NEWLINE;
}
+ /* MS-DOS or Mac newline leaking through on Unix */
+ #ifdef unix
+ if (c == '\r') {
+ c = tok_nextc(tok);
+ tok->atbol = 1;
+ if (blankline || tok->level > 0)
+ goto nextline;
+ *p_start = tok->start;
+ if (c != '\n') {
+ tok_backup(tok, c);
+ *p_end = tok->cur - 1; /* leave \r out */
+ }
+ else {
+ *p_end = tok->cur - 2; /* leave \r\n out */
+ }
+
+ return NEWLINE;
+ }
+ #endif
+
#ifdef macintosh
if (c == '\r') {
PySys_WriteStderr(
The first problem with this is that the logic overlaps with the
immediately previous case, "if (c == '\n')". That can be fixed by
merging the two and making the logic more complicated, which I didn't do
because I wanted the patch to be comprehensible.
The bigger problem is that there is *already* code for dealing with \r
in the tokenizer. Near the bottom if tokenizer.c, we see:
#ifndef macintosh
/* replace "\r\n" with "\n" */
/* For Mac we leave the \r, giving a syntax error */
pt = tok->inp - 2;
if (pt >= tok->buf && *pt == '\r') {
*pt++ = '\n';
*pt = '\0';
tok->inp = pt;
}
#endif
-- so that explains why \r\n in *files* is OK. But apparently, this
code never runs when parsing strings -- I checked this by putting in
some printf's; it looks like we hit this code for every EOL plus EOF
when parsing a file, but never when parsing a string.
Besides this weakness, it seems kind of sneaky to turn \r\n into \n in
the routine that's just supposed to be fetching characters from a
buffer. If my patch can be generalized, it strikes me as a more elegant
solution to the problem: if we see \n, or \r\n, or \r in the input
buffer, treat them all the same. (At least on Unix, where the C library
does no translation of newlines.)
A quick test: if I remove the "\r\n -> \n" hack, but leave my patch in,
it looks like the parser still handles Unix, MS-DOS, and Mac OS line
endings just fine. Oops, not quite: if a bare "\r" ends a comment line,
the following code line is swallowed by the comment. Well, I'd like to
hear if this sounds like the right solution before worrying about that
-- let me know what you think and I'll refine the patch.
Greg
--
Greg Ward - geek-on-the-loose gward@python.net
http://starship.python.net/~gward/
Condense soup, not books!