[Python-Dev] Put token information in one place

Serhiy Storchaka storchaka at gmail.com
Wed May 31 07:00:22 EDT 2017


Currently when you add a new token you need to change a couple of files:

* Include/token.h
* _PyParser_TokenNames in Parser/tokenizer.c
* PyToken_OneChar(), PyToken_TwoChars() or PyToken_ThreeChars() in 
Parser/tokenizer.c
* Lib/token.py (generated from Include/token.h)
* EXACT_TOKEN_TYPES in Lib/tokenize.py
* Operator, Bracket or Special in Lib/tokenize.py
* Doc/library/token.rst

It is possible to generate all this information from a single source. 
Proposed in [1] patch uses Lib/token.py as an initial source. But maybe 
Lib/token.py also should be generated from some file in general format? 
Some information can be derived from Grammar/Grammar, but not all. 
Needed also a mapping between token strings ('(' or '>=') and names 
(LPAR, GREATEREQUAL). Can this be added in Grammar/Grammar or a new file?

There is a related problem, the tokenize module uses three additional 
tokens not used by the C tokenizer. It modifies the content of the token 
module after importing it, that is not good. [2] One of solutions is 
making a copy of tok_names in tokenize before modifying it, but this 
doesn't work, because third-party code search tokenize constants in 
token.tok_names. Other solution is adding tokenize specific constants to 
the token module. Is this good to expose in the token module tokens not 
used in the C tokenizer?

Non-terminal symbols are generated automatically, Lib/symbol.py from 
Include/graminit.h, and Include/graminit.h and Python/graminit.c from 
Grammar/Grammar by Parser/pgen. Is it worth to generate Lib/symbol.py by 
pgen too? Can pgen be implemented in Python?

See also similar issue for opcodes. [3]

[1] https://bugs.python.org/issue30455
[2] https://bugs.python.org/issue25324
[3] https://bugs.python.org/issue17861



More information about the Python-Dev mailing list