free variables in generator expressions

In the generator expression (x+1 for x in L) the name 'L' is not local to the expression (as opposed to 'x'), I will call such a name a 'free name', as I am not aware of an existing terminology. The following is about how to treat 'free names' inside generator expressions. I argue that these names should be bound to their values when the generator expression is created, and the rest of this email tries to provide arguments why this may be desirable. I am fully aware that I tend to think about things in a very skewed manner though, so I would be grateful for any rebuttal. Recently I tried to implement a 'sieve' algorithm using generator expressions instead of lists. It wasn't the sieve of Eratosthenes but I will use that as a simpler example (all of the code shown below is python 2.5). Here is one way to implement the algorithm using list comprehensions (tuples could also be used as the mutability of lists is not used): def list_sieve(n): "Generate all primes less than n" P = range(2, n) while True: # Yield the first element of P and sieve out its multiples for p in P: yield p P = [x for x in P if x % p] break else: # If P was empty then we're done return
list(list_sieve(20)) [2, 3, 5, 7, 11, 13, 17, 19]
So that's ok. Then it occured to me that I didn't need to keep creating all theses lists; so I decided, without further thinking, to switch to generator expressions, as they are supposed to abstract the notion of iterable object (and in the function above I'm only using the 'iterable' properties of lists - any other iterable should do). So I came up with this: def incorrect_gen_sieve(n): "Unsuccessfully attempt to generate all primes less than n" # Change range to xrange P = xrange(2, n) while True: for p in P: yield p # Change list comprehension to generator expression P = (x for x in P if x % p) break else: return
list(incorrect_gen_sieve(20)) [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Ouch. Looking back at the code I realised that this was due to the fact that the names 'p' and 'P' in the generator expressions are not local to it, so subsequent binding of these names will affect the expression. I can't really analyse further than this, my head start spinning if I try :). So I wrapped the generator expression in a lambda function to make sure that the names p and P inside it were bound to what I intended them to: def gen_sieve(n): "Generate all primes less than n, but without tables!" P = xrange(2, n) while True: for p in P: yield p # Make sure that p and P are what they should be P = (lambda P, p: (x for x in P if x % p))(P, p) break else: return
list(gen_sieve(20)) [2, 3, 5, 7, 11, 13, 17, 19]
That's better. I like the idea of a sieve of Eratosthenes without any tables, and it's achievable in Python quite easily. The only problem is the one that I mentioned above, which boils down to: In a generator expression, free names can be bound to a new object between the time when they are defined and when they are used, thus changing the value of the expression. But I think that the behaviour of generator expressions would be more controllable and closer to that of 'real sequences' if the free names they contain were implicitly frozen when the generator expression is created. So I am proposing that for example: (f(x) for x in L) be equivalent to: (lambda f, L: (f(x) for x in L))(f, L) In most cases this would make generator expressions behave more like list comprehensions. You would be able to read the generator expression and think "that't what it does" more reliabley. Of course if a free name is bound to a mutable object, then there is always the chance that this object will be mutated between the creation of the generator expression and its use. Lastly, instead of using a generator expression I could have written: from itertools import ifilter from functools import partial def tools_sieve(n): "Generate all primes less than n" P = xrange(2, n) while True: for p in P: yield p P = ifilter(partial(int.__rmod__, p), P) break else: return
list(sieve(20)) [2, 3, 5, 7, 11, 13, 17, 19]
It obviously works as P and p are 'frozen' when ifilter and partial are called. If (f(x) for x in L if g(x)) is to be the moral equivalent of imap(f, ifilter(g, L)) Then my proposal makes even more sense. -- Arnaud

On Dec 12, 2007 12:10 PM, Arnaud Delobelle <arno@marooned.org.uk> wrote:
Calling it a free variable is the right term (at least according to Python and the functional programmers of the world). As for what you are asking for, I do believe it came up during the discussion of when genexps were added to the language. I honestly don't remember the reasoning as to why we didn't do it this way, but I am willing to guess it has something to do with simplicity and purity of what genexps are meant to be. Consider what your genexp, ``(x for x in P if x % p)``, really is:: def _genexp(): for x in P: if x % p: yield x But what you are after is:: def _genexp(): _P = P _p = p for x in _P: if x % _p: yield x The former maps to what you see in the genexp much more literally than the latter. And if you want the latter, just define a function like the above but have it take P and p as arguments and then you get your generator just the way you want it. Genexps (as with listcomps) are just really simple syntactic sugar. And Python tends to skew from hiding details like capturing variables implicitly in some piece of syntactic sugar. In my opinion it is better to be more explicit with what you want the generator to do and just write out a generator function. -Brett

On 12 Dec 2007, at 21:56, Georg Brandl wrote:
I see. 'P' gets frozen but not 'p', so I should be able to write: def gen2_sieve(n): "Generate all primes less than n" P = xrange(2, n) while True: for p in P: yield p P = (lambda p: (x for x in P if x % p))(p) break else: return
list(gen2_sieve(20)) [2, 3, 5, 7, 11, 13, 17, 19]
It seems to work. Ok then in the same vein, I imagine that (x + y for x in A for y in B) becomes: def _genexp(A): for x in A: for y in B: yield x + y Let's test this (python 2.5):
So in the generator expression, A is remains bound to the string '12' but B gets rebound to 'cd'. This may make the implementation of generator expressions more straighforward, but from the point of view of a user of the language it seems rather arbitrary. What makes A so special as opposed to B? Ok it belongs to the outermost loop, but conceptually in the example above there is no outermost loop. At the moment I still think it makes more sense for the generator expressions to generate as much as possible a sequence which is the same as what the corresponding list comprehension would have been, i.e.: l = [f(x, y) for x in A for y in B(x) if g(x, y)] g = [f(x, y) for x in A for y in B(x) if g(x, y)] <code, maybe binding A, B, f, g to new objects> assert list(g) == l to work as much as possible. Perhaps I should go and see how generator expressions are generated in the python source code. -- Arnaud

On 12 Dec 2007, at 23:41, Georg Brandl wrote:
You're right. I expressed myself badly: I was not talking about evaluation but binding. I was saying that if the name A is bound to the object that A is bound to when the generator expression is created, then the same should happen with B. -- Arnaud

On Thu, 13 Dec 2007 08:08:53 +0100, Arnaud Delobelle <arno@marooned.org.uk> wrote:
I think what Georg meant was this (I intended to reply this to your earlier mail of Thursday AM, but Georg beat me to it): The reason for not binding B when the genexp is defined is so you can do this:
Here, b can't be bound to something at generator definition time because the 'something' may not exist yet. (It does actually in this example, but you get the point.) So, only the first (outer loop) iterable is bound immediately. Whether a variable is rebound within the expression could of course be decided at compile time, so all free variables could be bound immediately. I think that would be an improvement, but it requires the compiler to be a bit smarter. Unfortunately, it seems to be pythonic to bind variables at moments I disagree with :), like function default arguments (bound at definition instead of call) and loop counters (rebound every iteration instead of every iteration having it's own scope). And, while I'm writing this: On Thu, 13 Dec 2007 00:01:42 +0100, Arnaud Delobelle <arno@marooned.org.uk> wrote:
I suppose this should have been g = (f(x, y) for x in A for y in B(x) if g(x, y)) Jan

On 15 Dec 2007, at 22:41, Jan Kanis wrote:
In your example, b is not free of course.
This is what I was advocating. As it is decided at compile time which variables are free, it may only be a small extra step to add a bit of code saying that they must be bound at the creation of the generator expression. Or, to continue with the _genexp function mentioned in previous posts, for: (f(x) for b in a for x in b) to be translated as def _genexp(f, A): for b in A: for x in b: yield f(x) as A and f are free but not b and x. Then gen = (f(x) for b in A for x in b) would be translated as gen = _genexp(f, A) I imagine this wouldn't be too hard, but I am not familiar with the specifics of python code compilation... Moreover this behaviour ('freezing' all free variables at the creation of the generator expression) is well defined and easy to reason on I think. I haven't yet had the time to see how generator expressions are created, but I'd like to have a look, although I suspect I will have to learn a lot more besides in order to understand it. [...]
Yes! Sorry about that. In fact, I should also have called the generator expression something else than 'g' as it is already the name of a function (g(x, y)) :| -- Arnaud

Arnaud Delobelle wrote:
My opinion is that this is the wrong place to attack the problem. Note that it's not only generator expressions that have this issue -- the same thing can trip you up with lambdas as well. It's instructive to consider why other languages with lexical scoping and first-class functions don't seem to have this problem to the same extent. In Scheme, for example, the reason is that its looping constructs usually create a new binding for the loop variable on each iteration, instead of re-using the same one. So if you return a lambda from inside the loop, each one lives in a different lexical environment and sees a different value for the loop variable. If Python's for-loop did the same thing, I suspect that this problem would turn up much less frequently. In CPython, there's a straightforward way to implement this: if the variable is used in an inner function, and is therefore in a cell, create a new cell each time round the loop instead of replacing the contents of the existing one. (If the variable isn't in a cell, there's no need to change anything.) Note that this wouldn't interfere with using the loop variable after the loop has finished -- the variable is still visible to the whole function, and the following code just sees whatever is in the last created cell. -- Greg

On Dec 12, 2007 12:10 PM, Arnaud Delobelle <arno@marooned.org.uk> wrote:
Calling it a free variable is the right term (at least according to Python and the functional programmers of the world). As for what you are asking for, I do believe it came up during the discussion of when genexps were added to the language. I honestly don't remember the reasoning as to why we didn't do it this way, but I am willing to guess it has something to do with simplicity and purity of what genexps are meant to be. Consider what your genexp, ``(x for x in P if x % p)``, really is:: def _genexp(): for x in P: if x % p: yield x But what you are after is:: def _genexp(): _P = P _p = p for x in _P: if x % _p: yield x The former maps to what you see in the genexp much more literally than the latter. And if you want the latter, just define a function like the above but have it take P and p as arguments and then you get your generator just the way you want it. Genexps (as with listcomps) are just really simple syntactic sugar. And Python tends to skew from hiding details like capturing variables implicitly in some piece of syntactic sugar. In my opinion it is better to be more explicit with what you want the generator to do and just write out a generator function. -Brett

On 12 Dec 2007, at 21:56, Georg Brandl wrote:
I see. 'P' gets frozen but not 'p', so I should be able to write: def gen2_sieve(n): "Generate all primes less than n" P = xrange(2, n) while True: for p in P: yield p P = (lambda p: (x for x in P if x % p))(p) break else: return
list(gen2_sieve(20)) [2, 3, 5, 7, 11, 13, 17, 19]
It seems to work. Ok then in the same vein, I imagine that (x + y for x in A for y in B) becomes: def _genexp(A): for x in A: for y in B: yield x + y Let's test this (python 2.5):
So in the generator expression, A is remains bound to the string '12' but B gets rebound to 'cd'. This may make the implementation of generator expressions more straighforward, but from the point of view of a user of the language it seems rather arbitrary. What makes A so special as opposed to B? Ok it belongs to the outermost loop, but conceptually in the example above there is no outermost loop. At the moment I still think it makes more sense for the generator expressions to generate as much as possible a sequence which is the same as what the corresponding list comprehension would have been, i.e.: l = [f(x, y) for x in A for y in B(x) if g(x, y)] g = [f(x, y) for x in A for y in B(x) if g(x, y)] <code, maybe binding A, B, f, g to new objects> assert list(g) == l to work as much as possible. Perhaps I should go and see how generator expressions are generated in the python source code. -- Arnaud

On 12 Dec 2007, at 23:41, Georg Brandl wrote:
You're right. I expressed myself badly: I was not talking about evaluation but binding. I was saying that if the name A is bound to the object that A is bound to when the generator expression is created, then the same should happen with B. -- Arnaud

On Thu, 13 Dec 2007 08:08:53 +0100, Arnaud Delobelle <arno@marooned.org.uk> wrote:
I think what Georg meant was this (I intended to reply this to your earlier mail of Thursday AM, but Georg beat me to it): The reason for not binding B when the genexp is defined is so you can do this:
Here, b can't be bound to something at generator definition time because the 'something' may not exist yet. (It does actually in this example, but you get the point.) So, only the first (outer loop) iterable is bound immediately. Whether a variable is rebound within the expression could of course be decided at compile time, so all free variables could be bound immediately. I think that would be an improvement, but it requires the compiler to be a bit smarter. Unfortunately, it seems to be pythonic to bind variables at moments I disagree with :), like function default arguments (bound at definition instead of call) and loop counters (rebound every iteration instead of every iteration having it's own scope). And, while I'm writing this: On Thu, 13 Dec 2007 00:01:42 +0100, Arnaud Delobelle <arno@marooned.org.uk> wrote:
I suppose this should have been g = (f(x, y) for x in A for y in B(x) if g(x, y)) Jan

On 15 Dec 2007, at 22:41, Jan Kanis wrote:
In your example, b is not free of course.
This is what I was advocating. As it is decided at compile time which variables are free, it may only be a small extra step to add a bit of code saying that they must be bound at the creation of the generator expression. Or, to continue with the _genexp function mentioned in previous posts, for: (f(x) for b in a for x in b) to be translated as def _genexp(f, A): for b in A: for x in b: yield f(x) as A and f are free but not b and x. Then gen = (f(x) for b in A for x in b) would be translated as gen = _genexp(f, A) I imagine this wouldn't be too hard, but I am not familiar with the specifics of python code compilation... Moreover this behaviour ('freezing' all free variables at the creation of the generator expression) is well defined and easy to reason on I think. I haven't yet had the time to see how generator expressions are created, but I'd like to have a look, although I suspect I will have to learn a lot more besides in order to understand it. [...]
Yes! Sorry about that. In fact, I should also have called the generator expression something else than 'g' as it is already the name of a function (g(x, y)) :| -- Arnaud

Arnaud Delobelle wrote:
My opinion is that this is the wrong place to attack the problem. Note that it's not only generator expressions that have this issue -- the same thing can trip you up with lambdas as well. It's instructive to consider why other languages with lexical scoping and first-class functions don't seem to have this problem to the same extent. In Scheme, for example, the reason is that its looping constructs usually create a new binding for the loop variable on each iteration, instead of re-using the same one. So if you return a lambda from inside the loop, each one lives in a different lexical environment and sees a different value for the loop variable. If Python's for-loop did the same thing, I suspect that this problem would turn up much less frequently. In CPython, there's a straightforward way to implement this: if the variable is used in an inner function, and is therefore in a cell, create a new cell each time round the loop instead of replacing the contents of the existing one. (If the variable isn't in a cell, there's no need to change anything.) Note that this wouldn't interfere with using the loop variable after the loop has finished -- the variable is still visible to the whole function, and the following code just sees whatever is in the last created cell. -- Greg
participants (5)
-
Arnaud Delobelle
-
Brett Cannon
-
Georg Brandl
-
Greg Ewing
-
Jan Kanis