[Python-Dev] Small tweak to tokenize.py?

Thu Nov 30 18:49:25 CET 2006

I've got a small tweak to tokenize.py that I'd like to run by folks here.

I'm working on a refactoring tool for Python 2.x-to-3.x conversion,
and my approach is to build a full parse tree with annotations that
show where the whitespace and comments go. I use the tokenize module
to scan the input. This is nearly perfect (I can render code from the
parse tree and it will be an exact match of the input) except for
continuation lines -- while the tokenize gives me pseudo-tokens for
comments and "ignored" newlines, it doesn't give me the backslashes at
all (while it does give me the newline following the backslash).

It would be trivial to add another yield to tokenize.py when the
backslah is detected:

--- tokenize.py	(revision 52865)
+++ tokenize.py	(working copy)
@@ -370,6 +370,8 @@
                 elif initial in namechars:                 # ordinary name
                     yield (NAME, token, spos, epos, line)
                 elif initial == '\\':                      # continued stmt
+                    # This yield is new; needed for better idempotency:
+                    yield (NL, initial, spos, (spos[0], spos[1]+1), line)
                     continued = 1
                 else:
                     if initial in '([{': parenlev = parenlev + 1

(Though I think that it should probably yield a single NL pseudo-token
whose value is a backslash followed by a newline; or perhaps it should
yield the backslash as a comment token, or as a new token. Thoughts?)

This wouldn't be 100% backwards compatible, so I'm not dreaming of
adding this to 2.5.1, but what about 2.6?

(There's another issue with tokenize.py too -- when you use it to
parse Python-like source code containing non-Python operators, e.g.
'?', it does something bogus.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)