Mailman 3 extending ast.parse with some lib2to3.pytree features - Python-ideas

8 Jul 2020

      With lib2to3 going away (https://bugs.python.org/issue40360), it seems to
me that some of its functionality for handling "whitespace" can be fairly
easily added to the ast module. (By "whitespace", I mean things like
indent, dedent, comments, backslash; and also the ability to manipulate the
encoded bytes in the original source.)

Off the top of my head, I know of the following projects that use lib2to3
or similar to access the "whitespace" in the parse tree and will need a new
solution:
yapf, black, mypy, pytype, kythe.
(If they don't use lib2to3, they need to maintain a custom parser that
changes with each release of Python, so my proposal would potentially help
them.)
Are there other projects that need access to the parse tree and
"whitespace"?

I propose implementing an optional pass over the parse tree that records
lib2to3's "prefix" with each leaf node. The interface would be something
like:

  def detect_encoding(source: bytes) -> str

  def parse_with_whitespace(source: bytes, encoding: str, filename: str) ->
ast.Module

  def unparse_bytes(tree: ast.Module) -> bytes

  def unparse_str(tree: ast.Module) -> str

  # Various convenience functions/properties, similar to
pytree.next_sibling etc.

parse_with_whitespace() calls ast.parse(), then does a pass over the parse
tree, adding to the leaf nodes:
    prefix: str  # whitespace and comments preceding token in the input
    pieces: List  # see below
    col_byte_offset: int  # start byte offset within line
    src_byte_offset: int  # start byte offset within source

The "pieces" field is intended to handle things like:
   x = 'abc' \
       "def"
   y = (f'abc{x}'  # comment
        "def")

Each "piece" would include:
  - byte offset from the beginning of the token (negative for a prefix
piece)
  - detailed type (single-quote string, double-quote string, format-string,
int, float, etc.)
  - source bytes
  - decoded value (Unicode str)

The ast.Module class would also be extended with additional attributes that
apply to the entire source, such as the encoding.

All of this is quite fiddly, but I already have some code for dealing with
the conversions between byte and string offsets, so I don't anticipate a
huge amount of work. (The design is also inherently a bit inefficient; but
I don't want to get involved with the internals of compile().)

A related item: "Parser module in the stdlib":
https://mail.python.org/archives/list/python-dev@python.org/thread/RHZ6JOEXJ...

extending ast.parse with some lib2to3.pytree features

Peter Ludemann

tags

participants (1)