A much better tokenize.untokenize function

I am creating this post as a courtesy to anyone interested in python's tokenize module. **tl;dr:** Various posts, linked below, discuss a much better replacement for untokenize. Do with it as you will. This code is very unlikely to be buggy. *Please* let me know if you find problems with it. **About the new untokenize** This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ announces a replacement for the untokenize function in tokenize.py: https://github.com/python/cpython/blob/3.8/Lib/tokenize.py To summarize this post: I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module: - The wretched, buggy, and impossible-to-fix add_whitespace method is gone. - The new code has no significant 'if' statements, and knows almost nothing about tokens! As I see it, the only possible failure modes might involve the zero-length line 0. See the above post for a full discussion. **Testing** This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/5X8IDzpgEAAJ discusses testing issues. Imo, the new code should easily pass all existing unit tests. The new code also passes a new unit test for Python issue 38663: https://bugs.python.org/issue38663, something existing tests fail to do, even in "compatibility mode" (2-tuples) . Imo, the way is now clear for proper unit testing of python's Untokenize class. In particular, it is, imo, time to remove compatibility mode. This hack has masked serious issues with untokenize: https://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=untokenize&submit=search&status=-1%2C1%2C2%2C3 **Summary** The new untokenize is the way it is written in The Book. I have done the heavy lifting on issue 38663. Python devs are free to do with it as they like. Your choice will not affect me or Leo in any way. The new code will soon become the foundation of Leo's token-oriented commands. Edward P.S. I would imagine that tokenize.untokenize is pretty much off most dev's radar :-) This Engineering Notebook post:https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ discusses (in way too much detail :-) why untokenize is important to me. To summarize that post: Imo, python devs are biased in favor of parse trees in programs involving text manipulations. I assert that the "real" black and fstringify tools would be significantly simpler, clearer and faster if they used python's tokenize module instead of python's ast module. Leo's own "beautify" and "fstringify" commands prove my assertion to my own satisfaction. This opinion will be controversial, so I want to make the strongest possible case. I need to prove that handling tokens can be done simply and correctly in all cases. This is a big ask, because python's tokens are complicated. See the Lexical Analysis section of the Python Language Reference. The new untokenize furnishes the required proof, and does so elegantly. EKR

On 11/3/2019 11:12 AM, Edward K. Ream wrote:
I am creating this post as a courtesy to anyone interested in python's tokenize module.
As one of the 46 contributors to this module, and as one who fixed several untokenize bugs a few years ago, I am interested.
To continue, the first two lines of tokenize.untokenize() are ut = Untokenizer() out = ut.untokenize(iterable) Your leoBeautify.Untokenize class appears to be completely unsuited as a replacement for tokenize.Untokenizer as the API for the class and method are incompatible with the above. 1. untokenize.Untokenizer takes no argument. leoBeautify.Untokenize() requires a 'contents' argument, a (unicode) string, that is otherwise undocumented. At first glance, it appears that 'contents' needs to be something like the desired output. (I could read the code where you call Untokenizer to improve my guess, but not now.) Since our exising tests do not pass 'contents', they should all fail. 2. untokenize.Untokenizer.untokenize(iterable) require an iterable that returns "sequences with at least two elements, the token type and the token string." https://docs.python.org/3/library/tokenize.html#tokenize.untokenize One can generate python code from a sequence of pairs with a guarantee that the resulting code will be tokenized by the python.exe parser into the same sequence. The doc continues "Any additional sequence elements are ignored." The intent is that a tool can tokenize a file, modify the file (and thereby possibly invalidate the begin, end, and line elements of the original token stream) and generate a modified file. [Note that the end index (4th element), when present, is not ignored but is used to improve white space insertion. I believe that this should be documented. What if the end index is no longer valid? Should be also use the start index?] leoBeautify.Untokenize.untokenize() requires an iterable of 5-tuples. It makes uses of both the start and end elements, as well as the mysterious required 'contents' string. there is an even more spectacular replacement: rebuild the code from the 'line' elements. But while the above is an essential test, it is a toy example with respect to applications. The challenge is to create a correct and valid file from less information, possibly with only token type and string. (The latter is 'compatibility mode'.)
-- Terry Jan Reedy

On Sun, Nov 3, 2019 at 4:14 PM Terry Reedy <tjreedy@udel.edu> wrote:
**tl;dr:** Various posts, linked below, discuss a much better replacement for untokenize.
If that were true, I would be interested. But as explained below, I don't believe it.
I do not believe that the tone of my post was in any way objectionable. Yes, there was implied criticism of the code. I see no reason for you or anyone else to take that criticism personally. The post script asserted only that token-based code might be a better choice (for some projects) than ast-based code. This is in no way a criticism of tokenize.py, or of any of its authors. Clearly, the new code would have to be repackaged if it were to be made part of tokenize.py. That's all I would like to say now regarding your comments. Perhaps other devs will pick up the ball. Edward

On Sun, Nov 3, 2019 at 7:23 PM Edward K. Ream <edreamleo@gmail.com> wrote: Clearly, the new code would have to be repackaged if it were to be made
part of tokenize.py. That's all I would like to say now regarding your comments. Perhaps other devs will pick up the ball.
After sleeping on your comments, I would like to distill them to their gist: 1. untokenize does not take a contents argument. 2. untokenize does not do exactly the same thing as the new code. These are legitimate complaints, and I'm not quite sure what, if anything, can be done about them. However, I am not quite ready dismiss the possibility of using the the new code in the tokenize module, because... *Motivations* For convenience, lets call the proposed new code the *gem *:-) Let's be clear, there is no reason for python libraries to include every little code gem. However, I have two motivations adding the gem to the tokenize module: 1. The gem would be useful for any token-based beautification code, such as the token-based versions of black or fstringify that Leo uses. Furthermore, the present untokenize method teaches away <https://www.google.com/search?client=firefox-b-1-d&q=teaches+away> from the gem. 2. issue 12691 <https://bugs.python.org/issue12691> is troubling, because the unit tests failed to detect a code blunder in add_whitespace. *Strengthening unit tests* It's not clear how the gem could strengthen unit tests. Indeed, the gem is always going to pass ;-) This is frustrating, because full sources *are *available to TestRoundtrip *.*check_roundtrip. Indeed, that method starts with: if isinstance(f, str): code = f.encode('utf-8') else: code = f.read() f.close() If f is a str, then f *is* "contents". Otherwise, "contents" can be recreated from the file's bytes. *Summary* Your objections are largely valid. Nevertheless, the gem may be worth further consideration. There is no need to add every little code gem to python's standard library. In this case, there are two reasons why adding the gem to tokenize.py might be a good idea: 1. The gem would be of use to anyone writing a token-based tool. Such tools *do* have access to full sources. 2. The present untokenize method teaches away from the gem. The gem corrects the mistaken impression that untokenizing in the presence of full sources is difficult. In fact, it's a snap. It is an open question whether it might be possible to strengthen the unit tests for the tokenize module using the gem. Full source*s are* available to TestRoundtrip*.*check_roundtrip. Edward

On 11/04/2019 03:42 AM, Edward K. Ream wrote:
On Sun, Nov 3, 2019 at 7:23 PM Edward K. Ream wrote:
Let's be clear, there is no reason for python libraries to include every little code gem. However, I have two motivations adding the gem to the tokenize module:
This is all irrelevant if you haven't signed the CLA*. -- ~Ethan~ * https://www.python.org/psf/contrib/contrib-form/

This is all irrelevant if you haven't signed the CLA*.
IMO, this shouldn't be an argument against the proposed changes, as the CLA is fairly straightforward and only takes 24-48 hours to process. Unless the OP specifically is unable to or doesn't want to sign the CLA, it won't be a significant concern. Most new authors usually open their first PR without having the CLA already signed, then have the CLA signed and processed while waiting for review from a core developer. On Mon, Nov 4, 2019 at 11:48 AM Ethan Furman <ethan@stoneleaf.us> wrote:

On Mon, Nov 4, 2019 at 5:42 AM Edward K. Ream <edreamleo@gmail.com> wrote:
...there are two reasons why adding the gem to tokenize.py might be a good idea.
Heh. I myself am not convinced that the gem needs "advertising" in tokenize.py. Those interested in how Leo's token-based commands work would be more likely to find the gem in Leo's Untokenize class. Edward

On 11/3/2019 11:12 AM, Edward K. Ream wrote:
I am creating this post as a courtesy to anyone interested in python's tokenize module.
As one of the 46 contributors to this module, and as one who fixed several untokenize bugs a few years ago, I am interested.
To continue, the first two lines of tokenize.untokenize() are ut = Untokenizer() out = ut.untokenize(iterable) Your leoBeautify.Untokenize class appears to be completely unsuited as a replacement for tokenize.Untokenizer as the API for the class and method are incompatible with the above. 1. untokenize.Untokenizer takes no argument. leoBeautify.Untokenize() requires a 'contents' argument, a (unicode) string, that is otherwise undocumented. At first glance, it appears that 'contents' needs to be something like the desired output. (I could read the code where you call Untokenizer to improve my guess, but not now.) Since our exising tests do not pass 'contents', they should all fail. 2. untokenize.Untokenizer.untokenize(iterable) require an iterable that returns "sequences with at least two elements, the token type and the token string." https://docs.python.org/3/library/tokenize.html#tokenize.untokenize One can generate python code from a sequence of pairs with a guarantee that the resulting code will be tokenized by the python.exe parser into the same sequence. The doc continues "Any additional sequence elements are ignored." The intent is that a tool can tokenize a file, modify the file (and thereby possibly invalidate the begin, end, and line elements of the original token stream) and generate a modified file. [Note that the end index (4th element), when present, is not ignored but is used to improve white space insertion. I believe that this should be documented. What if the end index is no longer valid? Should be also use the start index?] leoBeautify.Untokenize.untokenize() requires an iterable of 5-tuples. It makes uses of both the start and end elements, as well as the mysterious required 'contents' string. there is an even more spectacular replacement: rebuild the code from the 'line' elements. But while the above is an essential test, it is a toy example with respect to applications. The challenge is to create a correct and valid file from less information, possibly with only token type and string. (The latter is 'compatibility mode'.)
-- Terry Jan Reedy

On Sun, Nov 3, 2019 at 4:14 PM Terry Reedy <tjreedy@udel.edu> wrote:
**tl;dr:** Various posts, linked below, discuss a much better replacement for untokenize.
If that were true, I would be interested. But as explained below, I don't believe it.
I do not believe that the tone of my post was in any way objectionable. Yes, there was implied criticism of the code. I see no reason for you or anyone else to take that criticism personally. The post script asserted only that token-based code might be a better choice (for some projects) than ast-based code. This is in no way a criticism of tokenize.py, or of any of its authors. Clearly, the new code would have to be repackaged if it were to be made part of tokenize.py. That's all I would like to say now regarding your comments. Perhaps other devs will pick up the ball. Edward

On Sun, Nov 3, 2019 at 7:23 PM Edward K. Ream <edreamleo@gmail.com> wrote: Clearly, the new code would have to be repackaged if it were to be made
part of tokenize.py. That's all I would like to say now regarding your comments. Perhaps other devs will pick up the ball.
After sleeping on your comments, I would like to distill them to their gist: 1. untokenize does not take a contents argument. 2. untokenize does not do exactly the same thing as the new code. These are legitimate complaints, and I'm not quite sure what, if anything, can be done about them. However, I am not quite ready dismiss the possibility of using the the new code in the tokenize module, because... *Motivations* For convenience, lets call the proposed new code the *gem *:-) Let's be clear, there is no reason for python libraries to include every little code gem. However, I have two motivations adding the gem to the tokenize module: 1. The gem would be useful for any token-based beautification code, such as the token-based versions of black or fstringify that Leo uses. Furthermore, the present untokenize method teaches away <https://www.google.com/search?client=firefox-b-1-d&q=teaches+away> from the gem. 2. issue 12691 <https://bugs.python.org/issue12691> is troubling, because the unit tests failed to detect a code blunder in add_whitespace. *Strengthening unit tests* It's not clear how the gem could strengthen unit tests. Indeed, the gem is always going to pass ;-) This is frustrating, because full sources *are *available to TestRoundtrip *.*check_roundtrip. Indeed, that method starts with: if isinstance(f, str): code = f.encode('utf-8') else: code = f.read() f.close() If f is a str, then f *is* "contents". Otherwise, "contents" can be recreated from the file's bytes. *Summary* Your objections are largely valid. Nevertheless, the gem may be worth further consideration. There is no need to add every little code gem to python's standard library. In this case, there are two reasons why adding the gem to tokenize.py might be a good idea: 1. The gem would be of use to anyone writing a token-based tool. Such tools *do* have access to full sources. 2. The present untokenize method teaches away from the gem. The gem corrects the mistaken impression that untokenizing in the presence of full sources is difficult. In fact, it's a snap. It is an open question whether it might be possible to strengthen the unit tests for the tokenize module using the gem. Full source*s are* available to TestRoundtrip*.*check_roundtrip. Edward

On 11/04/2019 03:42 AM, Edward K. Ream wrote:
On Sun, Nov 3, 2019 at 7:23 PM Edward K. Ream wrote:
Let's be clear, there is no reason for python libraries to include every little code gem. However, I have two motivations adding the gem to the tokenize module:
This is all irrelevant if you haven't signed the CLA*. -- ~Ethan~ * https://www.python.org/psf/contrib/contrib-form/

This is all irrelevant if you haven't signed the CLA*.
IMO, this shouldn't be an argument against the proposed changes, as the CLA is fairly straightforward and only takes 24-48 hours to process. Unless the OP specifically is unable to or doesn't want to sign the CLA, it won't be a significant concern. Most new authors usually open their first PR without having the CLA already signed, then have the CLA signed and processed while waiting for review from a core developer. On Mon, Nov 4, 2019 at 11:48 AM Ethan Furman <ethan@stoneleaf.us> wrote:

On Mon, Nov 4, 2019 at 5:42 AM Edward K. Ream <edreamleo@gmail.com> wrote:
...there are two reasons why adding the gem to tokenize.py might be a good idea.
Heh. I myself am not convinced that the gem needs "advertising" in tokenize.py. Those interested in how Leo's token-based commands work would be more likely to find the gem in Leo's Untokenize class. Edward
participants (4)
-
Edward K. Ream
-
Ethan Furman
-
Kyle Stanley
-
Terry Reedy