[issue12691] tokenize.untokenize is broken

Wed Feb 5 02:52:52 CET 2014

Gareth Rees added the comment:

Yury, let me see if I can move this issue forward. I clearly haven't
done a good job of explaining these problems, how they are related,
and why it makes sense to solve them together, so let me have a go
now.

1. tokenize.untokenize() raises AssertionError if you pass it a
   sequence of tokens output from tokenize.tokenize(). This was my
   original problem report, and it's still not fixed in Python 3.4:

      Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) 
      [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import tokenize, io
      >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
      >>> tokenize.untokenize(t)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize
          out = ut.untokenize(iterable)
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize
          self.add_whitespace(start)
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace
          assert row <= self.prev_row
      AssertionError

   This defeats any attempt to use the sequence:

      input code -> tokenize -> transform -> untokenize -> output code

   to transform Python code. But this ought to be the main use case
   for the untokenize function! That's how I came across the problem
   in the first place, when I was starting to write Minipy
   <https://github.com/gareth-rees/minipy>.

2. Fixing problem #1 is easy (just swap <= for >=), but it raises the
   question: why wasn't this mistake caught by test_tokenize? There's
   a test function roundtrip() whose docstring says:

      Test roundtrip for `untokenize`. `f` is an open file or a
      string. The source code in f is tokenized, converted back to
      source code via tokenize.untokenize(), and tokenized again from
      the latter. The test fails if the second tokenization doesn't
      match the first.

   If I don't fix the problem with roundtrip(), then how can I be
   sure I have fixed the problem? Clearly it's necessary to fix the
   test case and establish that it provokes the assertion.

   So why doesn't roundtrip() detect the error? Well, it turns out
   that tokenize.untokenize() has two modes of operation and
   roundtrip() only tests one of them.

   The documentation for tokenize.untokenize() is rather cryptic, and
   all it says is:

      Each element returned by the [input] iterable must be a token
      sequence with at least two elements, a token number and token
      value. If only two tokens are passed, the resulting output is
      poor.

   By reverse-engineering the implementation, it seems that it has two
   modes of operation.

   In the first mode (which I have called "compatibility" mode after
   the method Untokenizer.compat() that implements it) you pass it
   tokens in the form of 2-element tuples (type, text). These must
   have exactly 2 elements.

   In the second mode (which I have called "full" mode based on the
   description "full input" in the docstring) you pass it tokens in
   the form of tuples with 5 elements (type, text, start, end, line).
   These are compatible with the namedtuples returned from
   tokenize.tokenize().

   The "full" mode has the buggy assertion, but
   test_tokenize.roundtrip() only tests the "compatibility" mode.

   So I must (i) fix roundtrip() so that it tests both modes; (ii)
   improve the documentation for tokenize.untokenize() so that
   programmers have some chance of figuring this out in future!

3. As soon as I make roundtrip() test both modes it provokes the
   assertion failure. Good, so I can fix the assertion. Problem #1
   solved.

   But now there are test failures in "full" mode:

      $ ./python.exe -m test test_tokenize
      [1/1] test_tokenize
      **********************************************************************
      File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
      Failed example:
          for testfile in testfiles:
              if not roundtrip(open(testfile, 'rb')):
                  print("Roundtrip failed for file %s" % testfile)
                  break
          else: True
      Expected:
          True
      Got:
          Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py
      **********************************************************************
      1 items had failures:
         1 of  73 in test.test_tokenize.__test__.doctests
      ***Test Failed*** 1 failures.
      test test_tokenize failed -- 1 of 78 doctests failed
      1 test failed:
          test_tokenize

   Examination of the failed tokenization shows that if the source
   contains a backslashed-newline, then in "full" mode
   Untokenize.add_whitespace() fails to leave any space between the
   token before the blackslash-newline and the token afterwards.

   The trouble is that backslash-newlines don't result in a token from
   tokenize(). But in update Untokenize.add_whitespace() I can deduce
   that there must have been a backslash-newline whenever I see the
   first token on a new line (other than a DEDENT or ENDMARKER) and
   the previous token was not a NEWLINE or NL.

4. OK, with problems 1-3 fixed, shouldn't I now actually test the
   documented round-trip property fully? The property that roundtrip()
   currently tests is:

      tokenize(untokenize(tokenize(code))) == tokenize(code)

   which is all that is guaranteed in "compatibility" mode. But the
   docstring for tokenize.untokenize() says:

      Round-trip invariant for full input:
          Untokenized source will match input source exactly

   So in "full" mode it seems that the following is guaranteed:

      untokenize(tokenize(code)) == code

   But if I add a test for this stronger round-trip property, it
   immediately fails. The docstring is not telling the truth!

5. Why doesn't the "full" round-trip property hold? It seems that
   there are (at least) two separate problems:

   i. tokenize() doesn't output a token for a backslash-newline so
      although untokenize() can deduce that there must have been one,
      it can't know what whitespace there was before it.

   ii. In compatibility mode, the first token is always discarded. If
      the tokens came from tokenize.tokenize(), then this is the
      ENCODING token, so the encoding is never applied.

      (Benjamin Peterson had previously discovered this problem but
      just suppressed the failing test case instead of reporting it:
      <http://hg.python.org/cpython/rev/c13abed5d764>)

Looking through this sequence of woes again, it's clear that I can
save problems 4 and 5 for another issue. But I am confident that
problems 1-3 must be fixed together. So although Meador Inge said
above that "the current patch fixes way too many issues", I hope that
if he reads the argument above he will see why I did what I did: this
is the minimum sensible change I know how to make.

I have prepared a revised patch that applies to the current head.
Please review it: I'll do my best to fix all the problems you find.

----------
Added file: http://bugs.python.org/file33919/Issue12691.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12691>
_______________________________________