Boundaries between numbers and identifiers
In Python 2.5 `0or[]` was accepted by the Python parser. It became an error in 2.6 because "0o" became recognizing as an incomplete octal number. `1or[]` still is accepted. On other hand, `1if 2else 3` is accepted despites the fact that "2e" can be recognized as an incomplete floating point number. In this case the tokenizer pushes "e" back and returns "2". Shouldn't it do the same with "0o"? It is possible to make `0or[]` be parseable again. Python implementation is able to tokenize this example: $ echo '0or[]' | ./python -m tokenize 1,0-1,1: NUMBER '0' 1,1-1,3: NAME 'or' 1,3-1,4: OP '[' 1,4-1,5: OP ']' 1,5-1,6: NEWLINE '\n' 2,0-2,0: ENDMARKER '' On other hand, all these examples look weird. There is an assymmetry: `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize visually the boundary between a number and the following identifier or keyword, especially if numbers can contain letters ("b", "e", "j", "o", "x") and underscores, and identifiers can contain digits. On both sides of the boundary can be letters, digits, and underscores. I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
On Apr 26, 2018, at 11:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
-1 This would make Python 3.8 reject code due to stylistic preference. Code that it actually can unambiguously parse today. I agree that a formatting style that omits whitespace between numerals and other tokens is terrible. However, if you start downright rejecting it, you will likely punish the wrong people. Users of third-party libraries will be met with random parsing errors in files they have no control over. This is not helpful. And given BPO-33338 the standard library tokenizer would have to keep parsing those things as is. Making 0or[] working again is also not worth it since that's been broken since Python 2.6 and hopefully nobody is running Python 2.5-only code anymore. What we should instead is to make the standard library tokenizer reflect the behavior of Python 2.6+. -- Ł
26.04.18 22:02, Lukasz Langa пише:
On Apr 26, 2018, at 11:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword. -1
This would make Python 3.8 reject code due to stylistic preference. Code that it actually can unambiguously parse today.
Of course I don't propose to make it a syntax error in 3.8. It should first emit a SyntaxWarning and be converted into an error only in 3.10. Or maybe first add a rule for this in PEP 8 and make it a syntax error in distant future, after all style checkers include it.
I agree that a formatting style that omits whitespace between numerals and other tokens is terrible. However, if you start downright rejecting it, you will likely punish the wrong people. Users of third-party libraries will be met with random parsing errors in files they have no control over. This is not helpful.
And given BPO-33338 the standard library tokenizer would have to keep parsing those things as is.
Making 0or[] working again is also not worth it since that's been broken since Python 2.6 and hopefully nobody is running Python 2.5-only code anymore.
What we should instead is to make the standard library tokenizer reflect the behavior of Python 2.6+.
The behavior of the standard library tokenizer doesn't contradict rules. It is the most natural behavior of regex-based tokenizer. Actually the behavior of the building tokenizer can be incorrect. In any case accepting `1if 2else 3` and rejecting `0or[]` looks weird. They should use the same rule. "0or" and "2else" should be considered ambiguous or unambiguous in the same way.
26.04.18 21:37, Serhiy Storchaka пише:
In Python 2.5 `0or[]` was accepted by the Python parser. It became an error in 2.6 because "0o" became recognizing as an incomplete octal number. `1or[]` still is accepted.
On other hand, `1if 2else 3` is accepted despites the fact that "2e" can be recognized as an incomplete floating point number. In this case the tokenizer pushes "e" back and returns "2".
Shouldn't it do the same with "0o"? It is possible to make `0or[]` be parseable again. Python implementation is able to tokenize this example:
$ echo '0or[]' | ./python -m tokenize 1,0-1,1: NUMBER '0' 1,1-1,3: NAME 'or' 1,3-1,4: OP '[' 1,4-1,5: OP ']' 1,5-1,6: NEWLINE '\n' 2,0-2,0: ENDMARKER ''
On other hand, all these examples look weird. There is an assymmetry: `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize visually the boundary between a number and the following identifier or keyword, especially if numbers can contain letters ("b", "e", "j", "o", "x") and underscores, and identifiers can contain digits. On both sides of the boundary can be letters, digits, and underscores.
I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
New example was found recently (see https://bugs.python.org/issue43833).
[0x1for x in (1,2)] [31]
It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)]. Since this code is clearly ambiguous, it makes more sense to emit a SyntaxWarning if there is no space between number and identifier.
On Tue, Apr 13, 2021 at 12:55 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
26.04.18 21:37, Serhiy Storchaka пише:
In Python 2.5 `0or[]` was accepted by the Python parser. It became an error in 2.6 because "0o" became recognizing as an incomplete octal number. `1or[]` still is accepted.
On other hand, `1if 2else 3` is accepted despites the fact that "2e" can be recognized as an incomplete floating point number. In this case the tokenizer pushes "e" back and returns "2".
Shouldn't it do the same with "0o"? It is possible to make `0or[]` be parseable again. Python implementation is able to tokenize this example:
$ echo '0or[]' | ./python -m tokenize 1,0-1,1: NUMBER '0' 1,1-1,3: NAME 'or' 1,3-1,4: OP '[' 1,4-1,5: OP ']' 1,5-1,6: NEWLINE '\n' 2,0-2,0: ENDMARKER ''
On other hand, all these examples look weird. There is an assymmetry: `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize visually the boundary between a number and the following identifier or keyword, especially if numbers can contain letters ("b", "e", "j", "o", "x") and underscores, and identifiers can contain digits. On both sides of the boundary can be letters, digits, and underscores.
I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
New example was found recently (see https://bugs.python.org/issue43833).
[0x1for x in (1,2)] [31]
It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
Since this code is clearly ambiguous, it makes more sense to emit a SyntaxWarning if there is no space between number and identifier.
I would totally make that a SyntaxError, and backwards compatibility be damned. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
It would be useful to first estimate how many projects would be broken by such incompatible change (stricter syntax). Inada-san wrote https://github.com/methane/notes/blob/master/2020/wchar-cache/download_sdist... to download source files using https://hugovk.github.io/top-pypi-packages/ API (top 4000 PyPI projects). Victor On Tue, Apr 13, 2021 at 10:59 PM Guido van Rossum <guido@python.org> wrote:
On Tue, Apr 13, 2021 at 12:55 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
26.04.18 21:37, Serhiy Storchaka пише:
In Python 2.5 `0or[]` was accepted by the Python parser. It became an error in 2.6 because "0o" became recognizing as an incomplete octal number. `1or[]` still is accepted.
On other hand, `1if 2else 3` is accepted despites the fact that "2e" can be recognized as an incomplete floating point number. In this case the tokenizer pushes "e" back and returns "2".
Shouldn't it do the same with "0o"? It is possible to make `0or[]` be parseable again. Python implementation is able to tokenize this example:
$ echo '0or[]' | ./python -m tokenize 1,0-1,1: NUMBER '0' 1,1-1,3: NAME 'or' 1,3-1,4: OP '[' 1,4-1,5: OP ']' 1,5-1,6: NEWLINE '\n' 2,0-2,0: ENDMARKER ''
On other hand, all these examples look weird. There is an assymmetry: `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize visually the boundary between a number and the following identifier or keyword, especially if numbers can contain letters ("b", "e", "j", "o", "x") and underscores, and identifiers can contain digits. On both sides of the boundary can be letters, digits, and underscores.
I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
New example was found recently (see https://bugs.python.org/issue43833).
[0x1for x in (1,2)] [31]
It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
Since this code is clearly ambiguous, it makes more sense to emit a SyntaxWarning if there is no space between number and identifier.
I would totally make that a SyntaxError, and backwards compatibility be damned.
-- --Guido van Rossum (python.org/~guido) Pronouns: he/him (why is my pronoun here?) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OU3USHVM... Code of Conduct: http://python.org/psf/codeofconduct/
-- Night gathers, and now my watch begins. It shall not end until my death.
Also, would it be possible to enhance to tokenizer to report a SyntaxWarning, rather than a SyntaxError? Victor
This isn't a "professional" or probably even "valid" use case for Python but one area this behavior is heavily used is code golf. For those not familiar with code golf is a type of puzzle where the objective is to complete a set of requirements in the least number of source code characters as possible. Out of mainstream languages Python is surprisingly good code golf. This is just for fun puzzle solving and not a reason to keep or change syntax in any particular way, in fact succeeding at code golf may even be loosely correlated to bad syntax rules as puzzles tend to be completed in one of the least readable ways a language can be written in. But at least be aware if this becomes forbidden syntax that's likely the most affected area of Python usage. But it also made me think it could affect code minifiers, which is apparently a real use case in Python: https://github.com/dflook/python-minifier (Seems this minifier doesn't actually remove the spaces between numbers and keywords where is could but fascinating niche of Python I did not know about) Regards Damian (he/him) On Wed, Apr 14, 2021 at 7:56 AM Victor Stinner <vstinner@python.org> wrote:
Also, would it be possible to enhance to tokenizer to report a SyntaxWarning, rather than a SyntaxError?
Victor _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CH7SLXKI... Code of Conduct: http://python.org/psf/codeofconduct/
On Apr 13, 2021, at 12:52, Serhiy Storchaka <storchaka@gmail.com> wrote:
New example was found recently (see https://bugs.python.org/issue43833).
[0x1for x in (1,2)] [31]
It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
That’s a wonderfully terrible example! Who’s maintaining the list? :D -Barry
I feel like all of these examples, if found in the wild, are far more likely to be uncaught bugs than programmer intent. Being strict about spaces (or parents, brackets, etc. in other contexts) around numbers is much more straightforward than a number of edge cases where is not obvious what will happen. On Tue, Apr 13, 2021, 6:24 PM Barry Warsaw <barry@python.org> wrote:
On Apr 13, 2021, at 12:52, Serhiy Storchaka <storchaka@gmail.com> wrote:
New example was found recently (see https://bugs.python.org/issue43833).
[0x1for x in (1,2)] [31]
It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
That’s a wonderfully terrible example! Who’s maintaining the list? :D
-Barry
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7JXD7SOH... Code of Conduct: http://python.org/psf/codeofconduct/
participants (8)
-
Barry Warsaw
-
Damian Shaw
-
David Mertz
-
Greg Ewing
-
Guido van Rossum
-
Lukasz Langa
-
Serhiy Storchaka
-
Victor Stinner