extending ast.parse with some lib2to3.pytree features
With lib2to3 going away (https://bugs.python.org/issue40360), it seems to me that some of its functionality for handling "whitespace" can be fairly easily added to the ast module. (By "whitespace", I mean things like indent, dedent, comments, backslash; and also the ability to manipulate the encoded bytes in the original source.) Off the top of my head, I know of the following projects that use lib2to3 or similar to access the "whitespace" in the parse tree and will need a new solution: yapf, black, mypy, pytype, kythe. (If they don't use lib2to3, they need to maintain a custom parser that changes with each release of Python, so my proposal would potentially help them.) Are there other projects that need access to the parse tree and "whitespace"? I propose implementing an optional pass over the parse tree that records lib2to3's "prefix" with each leaf node. The interface would be something like: def detect_encoding(source: bytes) -> str def parse_with_whitespace(source: bytes, encoding: str, filename: str) -> ast.Module def unparse_bytes(tree: ast.Module) -> bytes def unparse_str(tree: ast.Module) -> str # Various convenience functions/properties, similar to pytree.next_sibling etc. parse_with_whitespace() calls ast.parse(), then does a pass over the parse tree, adding to the leaf nodes: prefix: str # whitespace and comments preceding token in the input pieces: List # see below col_byte_offset: int # start byte offset within line src_byte_offset: int # start byte offset within source The "pieces" field is intended to handle things like: x = 'abc' \ "def" y = (f'abc{x}' # comment "def") Each "piece" would include: - byte offset from the beginning of the token (negative for a prefix piece) - detailed type (single-quote string, double-quote string, format-string, int, float, etc.) - source bytes - decoded value (Unicode str) The ast.Module class would also be extended with additional attributes that apply to the entire source, such as the encoding. All of this is quite fiddly, but I already have some code for dealing with the conversions between byte and string offsets, so I don't anticipate a huge amount of work. (The design is also inherently a bit inefficient; but I don't want to get involved with the internals of compile().) A related item: "Parser module in the stdlib": https://mail.python.org/archives/list/python-dev@python.org/thread/RHZ6JOEXJ...
participants (1)
-
Peter Ludemann