Backward incompatible change about docstring AST

Hi, all. There is design discussion which is deferred blocker of 3.7. https://bugs.python.org/issue32911 ## Background An year ago, I moved docstring in AST from statements list to field of module, class and functions. https://bugs.python.org/issue29463 Without this change, AST-level constant folding was complicated because "foo" can be docstring but "fo" + "o" can't be docstring. This simplified some other edge cases. For example, future import must be on top of the module, but docstring can be before it. Docstring is very special than other expressions/statement. Of course, this change was backward incompatible. Tools reading/writing docstring via AST will be broken by this change. For example, it broke PyFlakes, and PyFlakes solved it already. https://github.com/PyCQA/pyflakes/pull/273 Since AST doesn't guarantee backward compatibility, we can change AST if it's reasonable. Last week, Mark Shannon reported issue about this backward incompatibility. As he said, this change losted lineno and column of docstring from AST. https://bugs.python.org/issue32911#msg312567 ## Design discussion And as he said, there are three options: https://bugs.python.org/issue32911#msg312625
1 is backward compatible about reading docstring. But when writing, it's not DRY or SSOT. There are two source of docstring. For example: `ast.Module([ast.Str("spam")], docstring="egg")` 2 is considerable. I tried to implement this idea by adding `DocString` statement AST. https://github.com/python/cpython/pull/5927/files While it seems large change, most changes are reverting the AST changes. So it's more closer to 3.6 codebase. (especially, test_ast is very close to 3.6) In this PR, `ast.Module([ast.Str("spam")])` doesn't have docstring for simplicity. So it's backward incompatible for both of reading and writing docstring too. But it keeps lineno and column of docstring in AST. 3 is most conservative because 3.7b2 was cut now and there are some tools supporting 3.7 already. I prefer 2 or 3. If we took 3, I don't want to do 2 in 3.8. One backward incompatible change is better than two. Any thoughts? -- INADA Naoki <songofacandy@gmail.com>

27.02.18 15:37, INADA Naoki пише:
Other examples: coveragepy: https://bitbucket.org/ned/coveragepy/commits/99176232199b pytest: https://github.com/pytest-dev/pytest/pull/2870
Last week, Mark Shannon reported issue about this backward incompatibility. As he said, this change losted lineno and column of docstring from AST.
While losing lineno and column is a loss, it is not so large. There are existing issues with docstring position. 1. CPython and PyPy set different position for multiline strings. PyPy sets the position of the start of string, but CPython sets the position of the end of the string. A program that utilizes the docstring position needs to handle both of these cases. 2. Usually the position of the docstring is used for determining the absolute position of some fragments in the docstring (for example doctests). But since the literal string can contain \n and escaped newlines, and this information is lost in AST, the position of the fragment can be determined accurately. This is just a best effort approximation. 3. You can determine an approximate position of the docstring by positions of preceding or following nodes.

On 2/27/2018 9:32 AM, Serhiy Storchaka wrote:
You obviously meant 'cannot be determined accurately', because for reasons including the ones you gave, the relation between string literals in code and the resulting string objects is many-to-one, (as with int literals).
-- Terry Jan Reedy

28.02.18 00:31, Terry Reedy пише:
Yes, thank you for correction. The reasons are that the relation between lines of source code and lines of the resulting string objects is not easy. The string literal 'a\nb\c' takes one line in the source code, but produces a three-line string object. On other side, the string literal '''\ abc\ ''' takes three lines in the source code, but produces a single-line string object. And it is not so rare that a docstring starts with an escaped newline.

Hi, On 27 February 2018 at 15:32, Serhiy Storchaka <storchaka@gmail.com> wrote:
If that's true it's a PyPy bug. https://bitbucket.org/pypy/pypy/issues/2767/docstring-position-in-the-ast A bientôt, Armin.

On 27/02/18 13:37, INADA Naoki wrote:
The AST module does make some guarantees. The general advice for anyone wanting to do bytecode generation is "don't generate bytecodes directly, use the AST module." However, as long as the AST -> bytecode conversion remains the same, I think it is OK to change source -> AST conversion.
This is my preferred option now.
I agree. Whatever we do, we should stick with it. Cheers, Mark.

27.02.18 15:37, INADA Naoki пише:
Other examples: coveragepy: https://bitbucket.org/ned/coveragepy/commits/99176232199b pytest: https://github.com/pytest-dev/pytest/pull/2870
Last week, Mark Shannon reported issue about this backward incompatibility. As he said, this change losted lineno and column of docstring from AST.
While losing lineno and column is a loss, it is not so large. There are existing issues with docstring position. 1. CPython and PyPy set different position for multiline strings. PyPy sets the position of the start of string, but CPython sets the position of the end of the string. A program that utilizes the docstring position needs to handle both of these cases. 2. Usually the position of the docstring is used for determining the absolute position of some fragments in the docstring (for example doctests). But since the literal string can contain \n and escaped newlines, and this information is lost in AST, the position of the fragment can be determined accurately. This is just a best effort approximation. 3. You can determine an approximate position of the docstring by positions of preceding or following nodes.

On 2/27/2018 9:32 AM, Serhiy Storchaka wrote:
You obviously meant 'cannot be determined accurately', because for reasons including the ones you gave, the relation between string literals in code and the resulting string objects is many-to-one, (as with int literals).
-- Terry Jan Reedy

28.02.18 00:31, Terry Reedy пише:
Yes, thank you for correction. The reasons are that the relation between lines of source code and lines of the resulting string objects is not easy. The string literal 'a\nb\c' takes one line in the source code, but produces a three-line string object. On other side, the string literal '''\ abc\ ''' takes three lines in the source code, but produces a single-line string object. And it is not so rare that a docstring starts with an escaped newline.

Hi, On 27 February 2018 at 15:32, Serhiy Storchaka <storchaka@gmail.com> wrote:
If that's true it's a PyPy bug. https://bitbucket.org/pypy/pypy/issues/2767/docstring-position-in-the-ast A bientôt, Armin.

On 27/02/18 13:37, INADA Naoki wrote:
The AST module does make some guarantees. The general advice for anyone wanting to do bytecode generation is "don't generate bytecodes directly, use the AST module." However, as long as the AST -> bytecode conversion remains the same, I think it is OK to change source -> AST conversion.
This is my preferred option now.
I agree. Whatever we do, we should stick with it. Cheers, Mark.
participants (6)
-
Armin Rigo
-
Brett Cannon
-
INADA Naoki
-
Mark Shannon
-
Serhiy Storchaka
-
Terry Reedy