[Python-Dev] Small tweak to tokenize.py?

Thu Nov 30 19:28:25 CET 2006

Are you opposed changing tokenize? If so, why (apart from
compatibility)? ISTM that it would be a good thing if it reported
everything except horizontal whitespace.

On 11/30/06, Phillip J. Eby <pje at telecommunity.com> wrote:
> At 09:49 AM 11/30/2006 -0800, Guido van Rossum wrote:
> >I've got a small tweak to tokenize.py that I'd like to run by folks here.
> >
> >I'm working on a refactoring tool for Python 2.x-to-3.x conversion,
> >and my approach is to build a full parse tree with annotations that
> >show where the whitespace and comments go. I use the tokenize module
> >to scan the input. This is nearly perfect (I can render code from the
> >parse tree and it will be an exact match of the input) except for
> >continuation lines -- while the tokenize gives me pseudo-tokens for
> >comments and "ignored" newlines, it doesn't give me the backslashes at
> >all (while it does give me the newline following the backslash).
>
> The following routine will render a token stream, and it automatically
> restores the missing \'s.  I don't know if it'll work with your patch, but
> perhaps you could use it instead of changing tokenize.  For the
> documentation and examples, see:
>
> http://peak.telecommunity.com/DevCenter/scale.dsl#converting-tokens-back-to-text
>
>
> def detokenize(tokens, indent=0):
>      """Convert `tokens` iterable back to a string."""
>      out = []; add = out.append
>      lr,lc,last = 0,0,''
>      baseindent = None
>      for tok, val, (sr,sc), (er,ec), line in flatten_stmt(tokens):
>          # Insert trailing line continuation and blanks for skipped lines
>          lr = lr or sr   # first line of input is first line of output
>          if sr>lr:
>              if last:
>                  if len(last)>lc:
>                      add(last[lc:])
>                  lr+=1
>              if sr>lr:
>                  add(' '*indent + '\\\n'*(sr-lr))    # blank continuation lines
>              lc = 0
>
>          # Re-indent first token on line
>          if lc==0:
>              if tok==INDENT:
>                  continue  # we want to dedent first actual token
>              else:
>                  curindent = len(line[:sc].expandtabs())
>                  if baseindent is None and tok not in WHITESPACE:
>                      baseindent = curindent
>                  elif baseindent is not None and curindent>=baseindent:
>                      add(' ' * (curindent-baseindent))
>                  if indent and tok not in (DEDENT, ENDMARKER, NL, NEWLINE):
>                      add(' ' * indent)
>
>          # Not at start of line, handle intraline whitespace by retaining it
>          elif sc>lc:
>              add(line[lc:sc])
>
>          if val:
>              add(val)
>
>          lr,lc,last = er,ec,line
>
>      return ''.join(out)
>
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)