[issue20387] tokenize/untokenize roundtrip fails with tabs

Sun Feb 2 11:08:37 CET 2014

Terry J. Reedy added the comment:

I think the problem is with untokenize.

s =b"if False:\n\tx=3\n\ty=3\n"
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i)

produces a token stream that seems correct.

TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n')
TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n')
TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n')
TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this:

s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
>>> 
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20387>
_______________________________________