Re: [Python-Dev] Small tweak to tokenize.py?

30 Nov 2006


      Are you opposed changing tokenize? If so, why (apart from
compatibility)? ISTM that it would be a good thing if it reported
everything except horizontal whitespace.

On 11/30/06, Phillip J. Eby  wrote:
...
At 09:49 AM 11/30/2006 -0800, Guido van Rossum wrote:
...
I've got a small tweak to tokenize.py that I'd like to run by folks here.
I'm working on a refactoring tool for Python 2.x-to-3.x conversion,
and my approach is to build a full parse tree with annotations that
show where the whitespace and comments go. I use the tokenize module
to scan the input. This is nearly perfect (I can render code from the
parse tree and it will be an exact match of the input) except for
continuation lines -- while the tokenize gives me pseudo-tokens for
comments and "ignored" newlines, it doesn't give me the backslashes at
all (while it does give me the newline following the backslash).
The following routine will render a token stream, and it automatically
restores the missing \'s.  I don't know if it'll work with your patch, but
perhaps you could use it instead of changing tokenize.  For the
documentation and examples, see:
http://peak.telecommunity.com/DevCenter/scale.dsl#converting-tokens-back-to-...
def detokenize(tokens, indent=0):
     """Convert `tokens` iterable back to a string."""
     out = []; add = out.append
     lr,lc,last = 0,0,''
     baseindent = None
     for tok, val, (sr,sc), (er,ec), line in flatten_stmt(tokens):
         # Insert trailing line continuation and blanks for skipped lines
         lr = lr or sr   # first line of input is first line of output
         if sr>lr:
             if last:
                 if len(last)>lc:
                     add(last[lc:])
                 lr+=1
             if sr>lr:
                 add(' '*indent + '\\\n'*(sr-lr))    # blank continuation lines
             lc = 0
# Re-indent first token on line
         if lc==0:
             if tok==INDENT:
                 continue  # we want to dedent first actual token
             else:
                 curindent = len(line[:sc].expandtabs())
                 if baseindent is None and tok not in WHITESPACE:
                     baseindent = curindent
                 elif baseindent is not None and curindent>=baseindent:
                     add(' ' * (curindent-baseindent))
                 if indent and tok not in (DEDENT, ENDMARKER, NL, NEWLINE):
                     add(' ' * indent)
# Not at start of line, handle intraline whitespace by retaining it
         elif sc>lc:
             add(line[lc:sc])
if val:
             add(val)
lr,lc,last = er,ec,line
return ''.join(out)
-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)