Re: [Python-Dev] Small tweak to tokenize.py?

At 09:49 AM 11/30/2006 -0800, Guido van Rossum wrote:
I've got a small tweak to tokenize.py that I'd like to run by folks here.
I'm working on a refactoring tool for Python 2.x-to-3.x conversion, and my approach is to build a full parse tree with annotations that show where the whitespace and comments go. I use the tokenize module to scan the input. This is nearly perfect (I can render code from the parse tree and it will be an exact match of the input) except for continuation lines -- while the tokenize gives me pseudo-tokens for comments and "ignored" newlines, it doesn't give me the backslashes at all (while it does give me the newline following the backslash).
The following routine will render a token stream, and it automatically restores the missing \'s. I don't know if it'll work with your patch, but perhaps you could use it instead of changing tokenize. For the documentation and examples, see: http://peak.telecommunity.com/DevCenter/scale.dsl#converting-tokens-back-to-... def detokenize(tokens, indent=0): """Convert `tokens` iterable back to a string.""" out = []; add = out.append lr,lc,last = 0,0,'' baseindent = None for tok, val, (sr,sc), (er,ec), line in flatten_stmt(tokens): # Insert trailing line continuation and blanks for skipped lines lr = lr or sr # first line of input is first line of output if sr>lr: if last: if len(last)>lc: add(last[lc:]) lr+=1 if sr>lr: add(' '*indent + '\\\n'*(sr-lr)) # blank continuation lines lc = 0 # Re-indent first token on line if lc==0: if tok==INDENT: continue # we want to dedent first actual token else: curindent = len(line[:sc].expandtabs()) if baseindent is None and tok not in WHITESPACE: baseindent = curindent elif baseindent is not None and curindent>=baseindent: add(' ' * (curindent-baseindent)) if indent and tok not in (DEDENT, ENDMARKER, NL, NEWLINE): add(' ' * indent) # Not at start of line, handle intraline whitespace by retaining it elif sc>lc: add(line[lc:sc]) if val: add(val) lr,lc,last = er,ec,line return ''.join(out)

Are you opposed changing tokenize? If so, why (apart from compatibility)? ISTM that it would be a good thing if it reported everything except horizontal whitespace. On 11/30/06, Phillip J. Eby <pje@telecommunity.com> wrote:
At 09:49 AM 11/30/2006 -0800, Guido van Rossum wrote:
I've got a small tweak to tokenize.py that I'd like to run by folks here.
I'm working on a refactoring tool for Python 2.x-to-3.x conversion, and my approach is to build a full parse tree with annotations that show where the whitespace and comments go. I use the tokenize module to scan the input. This is nearly perfect (I can render code from the parse tree and it will be an exact match of the input) except for continuation lines -- while the tokenize gives me pseudo-tokens for comments and "ignored" newlines, it doesn't give me the backslashes at all (while it does give me the newline following the backslash).
The following routine will render a token stream, and it automatically restores the missing \'s. I don't know if it'll work with your patch, but perhaps you could use it instead of changing tokenize. For the documentation and examples, see:
http://peak.telecommunity.com/DevCenter/scale.dsl#converting-tokens-back-to-...
def detokenize(tokens, indent=0): """Convert `tokens` iterable back to a string.""" out = []; add = out.append lr,lc,last = 0,0,'' baseindent = None for tok, val, (sr,sc), (er,ec), line in flatten_stmt(tokens): # Insert trailing line continuation and blanks for skipped lines lr = lr or sr # first line of input is first line of output if sr>lr: if last: if len(last)>lc: add(last[lc:]) lr+=1 if sr>lr: add(' '*indent + '\\\n'*(sr-lr)) # blank continuation lines lc = 0
# Re-indent first token on line if lc==0: if tok==INDENT: continue # we want to dedent first actual token else: curindent = len(line[:sc].expandtabs()) if baseindent is None and tok not in WHITESPACE: baseindent = curindent elif baseindent is not None and curindent>=baseindent: add(' ' * (curindent-baseindent)) if indent and tok not in (DEDENT, ENDMARKER, NL, NEWLINE): add(' ' * indent)
# Not at start of line, handle intraline whitespace by retaining it elif sc>lc: add(line[lc:sc])
if val: add(val)
lr,lc,last = er,ec,line
return ''.join(out)
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Are you opposed changing tokenize? If so, why (apart from compatibility)? ISTM that it would be a good thing if it reported everything except horizontal whitespace.
it would be a good thing if it could, optionally, be made to report horizontal whitespace as well. </F>

On 11/30/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
Guido van Rossum wrote:
Are you opposed changing tokenize? If so, why (apart from compatibility)? ISTM that it would be a good thing if it reported everything except horizontal whitespace.
it would be a good thing if it could, optionally, be made to report horizontal whitespace as well.
It's remarkably easy to get this out of the existing API; keep track of the end position returned by the previous call, and if it's different from the start position returned by the next call, slice the line text from the column positions, assuming the line numbers are the same. If the line numbers differ, something has been eating \n tokens; this shouldn't happen any more with my patch. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
it would be a good thing if it could, optionally, be made to report horizontal whitespace as well.
It's remarkably easy to get this out of the existing API
sure, but it would be even easier if I didn't have to write that code myself (last time I did that, I needed a couple of tries before the parser handled all cases correctly...). but maybe this could simply be handled by a helper generator in the tokenizer module, that simply wraps the standard tokenizer generator and inserts whitespace tokens where necessary?
keep track of the end position returned by the previous call, and if it's different from the start position returned by the next call, slice the line text from the column positions, assuming the line numbers are the same.If the line numbers differ, something has been eating \n tokens; this shouldn't happen any more with my patch.
you'll still have to deal with multiline strings, right? </F>

On 12/2/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
Guido van Rossum wrote:
it would be a good thing if it could, optionally, be made to report horizontal whitespace as well.
It's remarkably easy to get this out of the existing API
sure, but it would be even easier if I didn't have to write that code myself (last time I did that, I needed a couple of tries before the parser handled all cases correctly...).
but maybe this could simply be handled by a helper generator in the tokenizer module, that simply wraps the standard tokenizer generator and inserts whitespace tokens where necessary?
A helper sounds like a promising idea. Anyone interested in volunteering a patch?
keep track of the end position returned by the previous call, and if it's different from the start position returned by the next call, slice the line text from the column positions, assuming the line numbers are the same.If the line numbers differ, something has been eating \n tokens; this shouldn't happen any more with my patch.
you'll still have to deal with multiline strings, right?
No, they are returned as a single token whose start and stop correctly reflect line/col of the begin and end of the token. My current code (based on the second patch I gave in this thread and the algorithm described above) doesn't have to special-case anything except the ENDMARKER token (to break out of its loop :-). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (3)
-
Fredrik Lundh
-
Guido van Rossum
-
Phillip J. Eby