Mailman 3 A much better tokenize.untokenize function - Python-Dev

Nov. 3, 2019

      I am creating this post as a courtesy to anyone interested in python's tokenize module.

**tl;dr:** Various posts, linked below, discuss a much better replacement for untokenize.  Do with it as you will.

This code is very unlikely to be buggy.  *Please* let me know if you find problems with it.

**About the new untokenize**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ
announces a replacement for the untokenize function in tokenize.py: https://github.com/python/cpython/blob/3.8/Lib/tokenize.py

To summarize this post:

I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module:

- The wretched, buggy, and impossible-to-fix add_whitespace method is gone.
- The new code has no significant 'if' statements, and knows almost nothing about tokens!

As I see it, the only possible failure modes might involve the zero-length line 0.  See the above post for a full discussion.

**Testing**

This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/5X8IDzpgEAAJ discusses testing issues.
Imo, the new code should easily pass all existing unit tests.

The new code also passes a new unit test for Python issue 38663: https://bugs.python.org/issue38663,
something existing tests fail to do, even in "compatibility mode" (2-tuples) .

Imo, the way is now clear for proper unit testing of python's Untokenize class.

In particular, it is, imo, time to remove compatibility mode.  This hack has masked serious issues with untokenize:
https://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=untokenize&submit=search&status=-1%2C1%2C2%2C3

**Summary**

The new untokenize is the way it is written in The Book.

I have done the heavy lifting on issue 38663. Python devs are free to do with it as they like.

Your choice will not affect me or Leo in any way. The new code will soon become the foundation of Leo's token-oriented commands.

Edward

P.S. I would imagine that tokenize.untokenize is pretty much off most dev's radar :-)

This Engineering Notebook post:https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ
discusses (in way too much detail :-) why untokenize is important to me.

To summarize that post:

Imo, python devs are biased in favor of parse trees in programs involving text manipulations.  I assert that the "real" black and fstringify tools would be significantly simpler, clearer and faster if they used python's tokenize module instead of python's ast module. Leo's own "beautify" and "fstringify" commands prove my assertion to my own satisfaction.

This opinion will be controversial, so I want to make the strongest possible case. I need to prove that handling tokens can be done simply and correctly in all cases. This is a big ask, because python's tokens are complicated.  See the Lexical Analysis section of the Python Language Reference.

The new untokenize furnishes the required proof, and does so elegantly.

EKR

A much better tokenize.untokenize function

Edward K. Ream

Terry Reedy

Edward K. Ream

Edward K. Ream

Ethan Furman

Kyle Stanley

Edward K. Ream

Terry Reedy

Edward K. Ream

Edward K. Ream

Ethan Furman

Kyle Stanley

Edward K. Ream

tags

participants (4)