[Python-ideas] Proposal: Complex comprehensions containing statements

21 Feb 2020

      This is a proposal for a new syntax where a comprehension is written as the appropriate brackets containing a loop which can contain arbitrary statements.

Here are some simple examples. Instead of:

    [
        f(x)
        for y in z
        for x in y
        if g(x)
    ]

one may write:

    [
        for y in z:
            for x in y:
                if g(x):
                    f(x)
    ]

Instead of:

    lst = []
    for x in y:
        if cond(x):
            break
        z = f(x)
        lst.append(z * 2)

one may write:

    lst = [
        for x in y:
            if cond(x):
                break
            z = f(x)
            yield z * 2
    ]

Instead of:

    [
        {k: v for k, v in foo}
        for foo in bar
    ]

one may write:

    [
        for foo in bar:
            {for k, v in foo: k: v}
    ]

## Specification

A list/set/dict comprehension or generator expression is written as the appropriate brackets containing a `for` or `while` loop.

In the general case some expressions have `yield` in front and they become the values of the comprehension, like a generator function.

If the comprehension contains exactly one expression statement at any level of nesting, i.e. if there is only one place where a `yield` can be placed at the start of a statement, then `yield` is not required and the expression is implicitly yielded. In particular this means that any existing comprehension translated into the new style doesn't require `yield`.

If the comprehension doesn't contain exactly one expression statement and doesn't contain a `yield`, it's a SyntaxError.

### Dictionary comprehensions

For dictionary comprehensions, a `key: value` pair is allowed as its own pseudo-statement or in a yield. It's not a real expression and cannot appear inside other expressions.

This can potentially be confused with variable type annotations with no assigned value, e.g. `x: int`. But we can essentially apply the same rule as other comprehensions: either use `yield`, or only have one place where a `yield` could be added in front of a statement. So if there is only one pair `x: y` we try to implicitly yield that. The only way this could be misinterpreted is if a user declared the type of exactly one expression and completely forgot to give their comprehension elements, and the program would almost certainly fail spectacularly.

### Whitespace

If placing the loop on a single line would be valid syntax outside a comprehension (i.e. it just contains a simple statement) then we call this an *inline* comprehension. It can be inserted in the same line(s) as other code and formatted however the writer likes - there are no concerns about whitespace.

For a more complex comprehension, the loop must start and end with a newline, i.e. the lines containing the loop cannot contain any tokens from outside, including the enclosing brackets. For example, this is allowed:

    foo = [
        for x in y:
            if x > 0:
                f(x)
    ]

but this is not:

    foo = [for x in y:
               if x > 0:
                   f(x)]

This ensures that code is readable even at a quick glance. The eyes can quickly find where the loop starts and distinguish the embedded statements from the rest of the enclosing expression.

Furthermore, it's easy to copy paste entire lines to move them around, whereas refactoring the invalid example above without specific tools would be annoying and error-prone. It also makes it easy to adjust code outside the comprehension (e.g. rename `foo` to something longer) without messing up indentation and alignment.

Inside the loop, the rules for indentation and such are the same as anywhere else. The syntax of the loop is valid only if it's also valid as a normal loop outside any expression. The body of the loop must be more indented than the for/while keyword that starts the loop.

### Variable scope

Since comprehensions look like normal loops they should maybe behave like them again, including executing in the same scope and 'leaking' the iteration variable(s). Assignments via the walrus operator already affect the outer scope, only the iteration variable currently behaves differently. My understanding is that this is influenced by the fact that there is little reason to use the value of the iteration variable after a list comprehension completes since it will always be the last value in the iterable. But since the new syntax allows `break`, the value may become useful again.

I don't know what the right approach is here and I imagine it can generate plenty of debate. Given that this whole proposal is already controversial and likely to be rejected this may not be the best place to start discussion. But maybe it is, I don't know.

## Benefits/comparison to current methods

### Uniform syntax

The new comprehensions just look like normal loops in brackets, or generator functions. This should make them easier for beginners to learn than the old comprehensions.

A particular concept that's easier to learn is comprehensions that contain multiple loops. Consider this comprehension over a nested list:

    [
        f(cell)
        for row in matrix
        for cell in row
    ]

For beginners this can easily be confusing, [and sometimes for experienced coders too](https://mail.python.org/archives/list/python-ideas@python.org/message/BX7LWU...
). Yes there's a rule that one can learn, but putting it in reverse also seems logical, perhaps even more so:

    [
        f(cell)
        for cell in row
        for row in matrix
    ]

Now the comprehension is 'consistently backwards', it reads more like English, and the usage of `cell` is right next to its definition. But of course that order is wrong...unless we want a nested list comprehension that produces a new nested list:

    [
        [
            f(cell)
            for cell in row
        ]
        for row in matrix
    ]

Again, it's not hard for an experienced coder to understand this, but for a beginner grappling with new concepts this is not great. Now consider how the same two comprehensions would be written in the new syntax:

    [
        for row in matrix:
            for cell in row:
                f(cell)
    ]

    [
        for row in matrix:
            [
                for cell in row:
                    f(cell)
            ]
    ]

### Power and flexibility

Comprehensions are great and I love using them. I want to be able to use them more often. I know I can solve any problem with a loop, but it's obvious that comprehensions are much nicer or we wouldn't need to have them at all. Compare this code:

    new_matrix = []
    for row in matrix:
        new_row = []
        for cell in row:
            try:
                new_row.append(f(cell))
            except ValueError:
                new_row.append(0)
        new_matrix.append(new_row)

with the solution using the new syntax:

    new_matrix = [
        for row in matrix: [
            for cell in row:
                try:
                    yield f(cell)
                except ValueError:
                    yield 0
        ]
    ]

It's immediately visually obvious that it's building a new nested list, there's much less syntax for me to parse, and the variable `new_row` has gone from appearing 4 times to 0!

There have been many requests to add some special syntax to comprehensions to make them a bit more powerful:

- [Is this PEP-able? "with" statement inside genexps / list comprehensions](https://mail.python.org/archives/list/python-ideas@python.org/thread/BUD46OE...)
- [Allowing breaks in generator expressions by overloading the while keyword](https://mail.python.org/archives/list/python-ideas@python.org/thread/6PEOE5Z...)
- [while conditional in list comprehension ??](https://mail.python.org/archives/list/python-ideas@python.org/thread/RYBBHV3...)

This would solve all such problems neatly.

### No trying to fit things in a single expression

The current syntax can only contain one expression in the body. This restriction makes it difficult to solve certain problems elegantly and creates an uncomfortable grey area where it's hard to decide between squeezing maybe a bit too much into an expression or doing things 'manually'. This can lead to analysis paralysis and disagreements between coders and reviewers. For example, which of the following is the best?

    clean = [
        line.strip()
        for line in lines
        if line.strip()
    ]

    stripped = [line.strip() for line in lines]
    clean = [line for line in stripped if line]

    clean = list(filter(None, map(str.strip, lines)))

    clean = []
    for line in lines:
        line = line.strip()
        if line:
            clean.append(line)

    def clean_lines():
        for line in lines:
            line = line.strip()
            if line:
                yield line

    clean = list(clean_lines())

You probably have a favourite, but it's very subjective and this kind of problem requires judgement depending on the situation. For example, I'd choose the first version in this case, but a different version if I had to worry about duplicating something more complex or expensive than `.strip()`. And again, there's an awkward sweet spot where it's hard to decide whether I care enough about the duplication.

What about assignment expressions? We could do this:

    clean = [
        stripped
        for line in lines
        if (stripped := line.strip())
    ]

Like the nested loops, this is tricky to parse without experience. The execution order can be confusing and the variable is used away from where it's defined. Even if you like it, there are clearly many who don't. I think the fact that assignment expressions were a desired feature despite being so controversial is a symptom of this problem. It's the kind of thing that happens when we're stuck with the limitations of a single expression.

The solution with the new syntax is:

    clean = [
        for line in lines:
            stripped = line.strip()
            if stripped:
                stripped
    ]

or if you'd like to use an assignment expression:

    clean = [
        for line in lines:
            if stripped := line.strip():
                stripped
    ]

I think both of these look great and are easily better than any of the other options. And I think it would be the clear winner in any similar situation - no careful judgement needed. This would become the one (and only one) obvious way to do it. The new syntax has the elegance of list comprehensions and the flexibility of multiple statements. It's completely scalable and works equally well from the simplest comprehension to big complicated constructions.

### Easy to change

I hate when I've already written a list comprehension but a new requirement forces me to change it to, say, the `.append` version. It's a tedious refactoring involving brackets, colons, indentation, and moving things around. It also leaves me with a very unhelpful `git diff`. With the new syntax I can easily add logic as I please and get a nice simple diff.

[Python-ideas] Proposal: Complex comprehensions containing statements

Alex Hall