New __reference__ hook

Hi, guys. Python 3 is very close to become a holy grail of programming languages in the sense that almost everything could be redefined. However, there is still one thing missing: the immutable copy-on-assign numeric types. Consider this part of code: a = 1 b = a a += 1 assert a == b + 1 The object "1" gets assigned to the "a" variable, then another independent copy gets assigned to the "b" variable, then the value in the "a" variable gets modified without affecting the second. The problem is - this behaviour can not be recreated in user-defined classes: a = MyInteger(1) b = a a += 1 assert a == b + 1 The "a" and "b" variables both point to the same object. This is a difference on what one might expect with numeric types. My proposal is to define another hook that gets called when an object is referenced. def MyInteger: def __reference__(self, context): return copy.copy(self) Each time when a reference count of an object would normally get incremented, this method should be called and the returned object will be referenced. The default implementation would be of course to return self. The context argument will give the object some information of the reason how we are being referenced. This will allow easy implementation of such concepts as singletons, copy-on-write, immutables and even simplify things like reference loops. The most obvious use-case would be implementations of some mathematical types like vectors, polynomials and so on. I've encountered this problem when I was writing a simple vector library and I had to explicitly copy each object on assignment, which is particulary annoying. The programmer's intuition for numeric-like types is for them to be immutable, yet they should have augmented asignment operators. This is a thing that can not be implemented transparently in the current Python, so there is my proposal. Cheers, haael.

On 04/12/12 20:58, haael@interia.pl wrote:
I dispute that "everything can be redefined" is the holy grail of programming languages. If it were, why isn't everyone using Forth?
Correct, for some definition of "assigned" and "variable".
then another independent copy gets assigned to the "b" variable,
Completely, utterly wrong.
then the value in the "a" variable gets modified
Incorrect.
Of course it can. py> from decimal import Decimal # A pure-Python class, prior to Python 3.3 py> a = Decimal(1) py> b = a py> a += 1 py> assert a == b + 1 py> print a, b 2 1 If you prefer another example, use fractions.Fraction, also pure Python and immutable, with support for augmented assignment. I don't mean to be rude, or dismissive, but this is pretty basic Python stuff. Please start with the Python data and execution models: http://docs.python.org/2/reference/datamodel.html http://docs.python.org/2/reference/executionmodel.html although I must admit I don't find either of them especially clear. But in simple terms, you need to reset your thinking: your assumptions about what Python does are incorrect. When you say: a = 1 you are *binding* the name "a" to the object 1. When you then follow by saying: b = a you bind the name "b" to the *same* object 1. It is not a copy. It is not a "copy on assignment" or any other clever trick. You can prove to yourself that they are the same object: py> a = 1 py> b = a py> a is b True py> id(a), id(b) (140087472, 140087472) When you then call a += 1 this does not modify anything. Int objects (and floats, Decimal, strings, and many others) are *immutable* -- they cannot be modified. So `a += 1` creates a new object, 2, and binds it to the name "a". But the binding from object 1 to name "b" is not touched. If this is still not clear, I recommend you take the discussion onto one of the other Python mailing lists, especially tutor@python.org or python-list@python.org, which are more appropriate for discussing these things. -- Steven

On Dec 04, 2012, at 10:14 PM, Steven D'Aprano wrote:
I dispute that "everything can be redefined" is the holy grail of programming languages. If it were, why isn't everyone using Forth?
On the readability scale, where Python is pretty close to a 10 (almost everyone can read almost all Python), Perl is a 4 (hard to read your own code after a week or so), Forth is a 1 (you can't even read your own code after your fingers stop moving). :) a-forth-enthusiast-from-way-back-in-the-day-ly y'rs, -Barry

On Tue, Dec 4, 2012 at 4:58 AM, <haael@interia.pl> wrote:
You misunderstand Python's semantics. Python never implicitly copies anything. Some types, like int, are immutable so you can't distinguish meaningfully between copying and not. All names like `a` can be rebound `a = ....`, and this never mutates the object. Some objects can be mutated, which is done by some means other than rebinding a name. I don't know what problem you had defining MyInteger. Here is a definition (albeit comprised of very, very sloppy code) that passes your test class MyInteger(object): def __init__(self, i): self._i = i def __add__(self, other): if isinstance(other, MyInteger): other = other._i return MyInteger(self._i + other) def __eq__(self, other): return self._i == other._i Mike

On 05.12.2012 18:05, Jasper St. Pierre wrote:
And? What's wrong with an __iadd__ that's exactly the same as Mike's __add__?
I think it was a Java-confusion. He thought numbers were copied on assignment. But there is no difference between value types and object types in Python. Ints and floats are immutable, but they are not value types as in Java. But apart from that, I think allowing overloading of the binding operator "=" might be a good idea. A special method __bind__ could return the object to be bound: a = b should then bind the name "a" to the return value of b.__bind__() if b implements __bind__. Sure, it could be used to implement copy on assignment. But it would also do other things like allowing lazy evaluation of an expression. NumPy code like z = a*x + b*y + c could avoid creating three temporary arrays if there was a __bind__ function called on "=". This is a big thing, cf. the difference between NumPy and numexpr: z = numexpr.evaluate("""a*x + b*y + c""") The reason numerical expressions must be written as strings to be efficient in Python is because there is no __bind__ function. Sturla

On 2012-12-05, at 19:09 , Sturla Molden wrote:
Sounds odd and full of strange edge-cases. Would bind also get called when providing parameters to a function call? When putting an object in a literal of some sort? When returning an object from a function/method? If not, why not?
Why? z could just be a "lazy value" at this point, basically a manual building of thunks, only reifying them when necessary (whenever that is). It's not like numpy *has* to create three temporary arrays, just that it *does*.

Den 5. des. 2012 kl. 19:51 skrev Masklinn <masklinn@masklinn.net>:
It has to, because it does not know when to flush an expression. This strangely enough, accounts for most of the speed difference between Python/NumPy and e.g. Fortran 95. A Fortran 95 compiler can compile an array expression as a single loop. NumPy cannot, because the binary operators does not tell when an expression is "finalized". That is why the numexpr JIT compiler evaluates Python expressions as strings, and needs to include a parser and whatnot. Today, most numerical code is memory bound, not compute bound, as CPUs are immensely faster than RAM. So what keeps numerical/scientific code written in Python slower than C or Fortran today is mostly creation of temporary array objects – i.e. memory access –, not the computations per se. If we could get rid of temprary arrays, Python codes could possibly achieve 80 % of Fortran 95 speed. For scientistis that would mean we don't need to write any more Fortran or C. But perhaps it is possible to do this with AST magic? I don't know. Nor do I know if __bind__ is the best way to do this. Perhaps not. But I do know that automatically detecting when to "flush a compund expression with (NumPy?) arrays" would be the holy grail for scientific computing with Python. A binary operator x+y would just return a symbolic representation of the expression, but when the full expression needs to be flushed we can e.g. ask OpenCL or LLVM to generate the code on the fly. It would turn numerical computing into something similar to dynamic HTML. And we know how good Python is at generating structured text on the fly. Sturla

On 2012-12-05, at 21:09 , Sturla Molden wrote:
That tends to be the hard thing to decide, but it should be possible to find out most cases e.g. evaluate the thunks when elements are requested (similar to generators, but do the whole thunk at once), when printing, etc… Or use the numexpr approach and perform the reification explicitly.
But perhaps it is possible to do this with AST magic? I don't know.
I'm not sure there's even a need for AST magic (although you could also play with that by writing operations within lambdas I guess, I've never done much AST analysis/rewriting), it could simply use an approach similar to SQLAlchemy's's ClauseElement: when applying an operation to e.g. an array, rather than perform it just return a representation of the operation itself (effectively rebuild some sort of AST), new operations on *that* would simply build the tree further (composing the thunk), and an explicit evaluation call or implicit evaluation due to e.g. accessing stuff would compile the "potential" operation and perform the actual computations.
Indeed.

On 5 December 2012 18:09, Sturla Molden <sturla@molden.no> wrote:
Today that can be achieved by crafting a class that overrides all ops to perform literal transforms and with a "flush" or "calculate" method. Sympy does something like that, and it would not be hard to have a numpy module to perform like that with numpy arrays. In this particular use case, we'd have the full benefit of "explicit is better than implicit". js -><-

On 5 December 2012 18:09, Sturla Molden <sturla@molden.no> wrote:
FYI, the numpy module shipped with PyPy does exactly this: the operations are recorded in some AST structure, which is evaluated only when the first item of the array is read. This is completely transparent to the user, or to other parts of the interpreter. PyPy uses JIT techniques to generate machine code specialized for the particular AST, and is typically 2x to 5x faster than Numpy, probably because a lot of allocations/copies are avoided. -- Amaury Forgeot d'Arc

On Wed, Dec 5, 2012 at 3:09 PM, Sturla Molden <sturla@molden.no> wrote:
Not all pixel fiddling can be solved using array calculus, so there will always be C involved at some point. Still this could be a great advancement. Though I don't think bind-time is the right time to evaluate anything as it would drive optimizing programmers to "one-line" things. Using intermediate variable names to explain an algorithm is crucial for readability in my experience. Creating intermediate objects only to be evaluated when programmer explicitly demands is the way to go. E.g. "evalf" in sympy http://scipy-lectures.github.com/advanced/sympy.html

On 5 December 2012 20:09, Sturla Molden <sturla@molden.no> wrote:
But perhaps it is possible to do this with AST magic?
As far as I understand it, numba [1] does this kind of AST magic (among other things). https://github.com/numba/numba Thomas

On Wed, Dec 5, 2012 at 10:09 AM, Sturla Molden <sturla@molden.no> wrote:
It' seems a bit more complicated than that. Take the example below. When is __bind__ going to be called? After a is multiplied by x, b is multipled by y, etc. or before? If after, that doesn't accomplish lazy evaluation as below. If before, then somehow this has to convert to a form that calls z.__bind__(something) and what is that something?
There is another way to write expressions that don't get evaluated: lambda: a*x + b*y + c So you could write this as z.bind(lambda: rhs) or if this is important enough there could be a new bind operator: lhs @= rhs which is equivalent to lhs.__bind__(lambda: rhs) I think overriding = so sometimes it does regular binding and sometimes this magic binding would be confusing and dangerous. It means that every assignment operates differently if the lhs is already bound. Consider the difference between t = a + b typo = a+b t @= a+b typo @= a+b where typo was supposed to be t but was mistyped. In the first set, line 1 does __bind__ to a+b while line just adds a and b and does a normal binding. In the second set, the first does __bind__ while the second raises an exception that typo is not bound. It's even worse in the context of something like this: d = {} for i in range(2): d['x'] = a + i in the first pass through the loop this is a regular assignment. In the second pass it may call __bind__ depending on what the value of a + 0 is. Ick. --- Bruce Follow me: http://www.twitter.com/Vroo http://www.vroospeak.com

On 12/5/2012 1:09 PM, Sturla Molden wrote:
An assignment statement mutates the current local namespace. The 'current local namespace' is a hidden input to the function performed by all statements, but it need not be a python object. The key symbol '=' is not an operator and 'a = b' is not an expression.
If one wants to perform 'a = f(b)' or 'a = b.meth()' instead of 'a = b', then one should just explicitly say so.
No, z = (a*x + b*y * c).__bind__, which is how you defined .__bind__ working, still requires that the expression be evaluated to an object. The definition of Python requires computation of a*x, b*y, (a*x + b*y), and finally (a*x + b*y) + c in that order. Either '*' or '+' may have side-effects.
The reason numerical expressions must be written as strings to be efficient in Python is because there is no __bind__ function.
No, it is because the semantics of Python require inefficiency that can only be removed by a special parser-compiler with additional knowledge of the relevant object class, method, and instance properties. Such knowledge allows code to be re-written without changing the effect. For Fortran arrays, the needed information includes the number and length of each dimension. These are either declared or parameterized and passed as arguments. https://code.google.com/p/numexpr/wiki/Overview says that numexpr.evaluate(ex) first calls compile(ex), but does not say whether it has compile compile to cpython bytecode or only to ast. In either case, it converts arraywise operations to blockwise operations inside loops run on a custom C-coded virtual machine. They imply that this is not as good as elementwise operations compiled to native machine code. In any case, it knows that numpy array operations are side-effect free and must use the runtime dimension and size info. -- Terry Jan Reedy

On 04/12/12 20:58, haael@interia.pl wrote:
I dispute that "everything can be redefined" is the holy grail of programming languages. If it were, why isn't everyone using Forth?
Correct, for some definition of "assigned" and "variable".
then another independent copy gets assigned to the "b" variable,
Completely, utterly wrong.
then the value in the "a" variable gets modified
Incorrect.
Of course it can. py> from decimal import Decimal # A pure-Python class, prior to Python 3.3 py> a = Decimal(1) py> b = a py> a += 1 py> assert a == b + 1 py> print a, b 2 1 If you prefer another example, use fractions.Fraction, also pure Python and immutable, with support for augmented assignment. I don't mean to be rude, or dismissive, but this is pretty basic Python stuff. Please start with the Python data and execution models: http://docs.python.org/2/reference/datamodel.html http://docs.python.org/2/reference/executionmodel.html although I must admit I don't find either of them especially clear. But in simple terms, you need to reset your thinking: your assumptions about what Python does are incorrect. When you say: a = 1 you are *binding* the name "a" to the object 1. When you then follow by saying: b = a you bind the name "b" to the *same* object 1. It is not a copy. It is not a "copy on assignment" or any other clever trick. You can prove to yourself that they are the same object: py> a = 1 py> b = a py> a is b True py> id(a), id(b) (140087472, 140087472) When you then call a += 1 this does not modify anything. Int objects (and floats, Decimal, strings, and many others) are *immutable* -- they cannot be modified. So `a += 1` creates a new object, 2, and binds it to the name "a". But the binding from object 1 to name "b" is not touched. If this is still not clear, I recommend you take the discussion onto one of the other Python mailing lists, especially tutor@python.org or python-list@python.org, which are more appropriate for discussing these things. -- Steven

On Dec 04, 2012, at 10:14 PM, Steven D'Aprano wrote:
I dispute that "everything can be redefined" is the holy grail of programming languages. If it were, why isn't everyone using Forth?
On the readability scale, where Python is pretty close to a 10 (almost everyone can read almost all Python), Perl is a 4 (hard to read your own code after a week or so), Forth is a 1 (you can't even read your own code after your fingers stop moving). :) a-forth-enthusiast-from-way-back-in-the-day-ly y'rs, -Barry

On Tue, Dec 4, 2012 at 4:58 AM, <haael@interia.pl> wrote:
You misunderstand Python's semantics. Python never implicitly copies anything. Some types, like int, are immutable so you can't distinguish meaningfully between copying and not. All names like `a` can be rebound `a = ....`, and this never mutates the object. Some objects can be mutated, which is done by some means other than rebinding a name. I don't know what problem you had defining MyInteger. Here is a definition (albeit comprised of very, very sloppy code) that passes your test class MyInteger(object): def __init__(self, i): self._i = i def __add__(self, other): if isinstance(other, MyInteger): other = other._i return MyInteger(self._i + other) def __eq__(self, other): return self._i == other._i Mike

On 05.12.2012 18:05, Jasper St. Pierre wrote:
And? What's wrong with an __iadd__ that's exactly the same as Mike's __add__?
I think it was a Java-confusion. He thought numbers were copied on assignment. But there is no difference between value types and object types in Python. Ints and floats are immutable, but they are not value types as in Java. But apart from that, I think allowing overloading of the binding operator "=" might be a good idea. A special method __bind__ could return the object to be bound: a = b should then bind the name "a" to the return value of b.__bind__() if b implements __bind__. Sure, it could be used to implement copy on assignment. But it would also do other things like allowing lazy evaluation of an expression. NumPy code like z = a*x + b*y + c could avoid creating three temporary arrays if there was a __bind__ function called on "=". This is a big thing, cf. the difference between NumPy and numexpr: z = numexpr.evaluate("""a*x + b*y + c""") The reason numerical expressions must be written as strings to be efficient in Python is because there is no __bind__ function. Sturla

On 2012-12-05, at 19:09 , Sturla Molden wrote:
Sounds odd and full of strange edge-cases. Would bind also get called when providing parameters to a function call? When putting an object in a literal of some sort? When returning an object from a function/method? If not, why not?
Why? z could just be a "lazy value" at this point, basically a manual building of thunks, only reifying them when necessary (whenever that is). It's not like numpy *has* to create three temporary arrays, just that it *does*.

Den 5. des. 2012 kl. 19:51 skrev Masklinn <masklinn@masklinn.net>:
It has to, because it does not know when to flush an expression. This strangely enough, accounts for most of the speed difference between Python/NumPy and e.g. Fortran 95. A Fortran 95 compiler can compile an array expression as a single loop. NumPy cannot, because the binary operators does not tell when an expression is "finalized". That is why the numexpr JIT compiler evaluates Python expressions as strings, and needs to include a parser and whatnot. Today, most numerical code is memory bound, not compute bound, as CPUs are immensely faster than RAM. So what keeps numerical/scientific code written in Python slower than C or Fortran today is mostly creation of temporary array objects – i.e. memory access –, not the computations per se. If we could get rid of temprary arrays, Python codes could possibly achieve 80 % of Fortran 95 speed. For scientistis that would mean we don't need to write any more Fortran or C. But perhaps it is possible to do this with AST magic? I don't know. Nor do I know if __bind__ is the best way to do this. Perhaps not. But I do know that automatically detecting when to "flush a compund expression with (NumPy?) arrays" would be the holy grail for scientific computing with Python. A binary operator x+y would just return a symbolic representation of the expression, but when the full expression needs to be flushed we can e.g. ask OpenCL or LLVM to generate the code on the fly. It would turn numerical computing into something similar to dynamic HTML. And we know how good Python is at generating structured text on the fly. Sturla

On 2012-12-05, at 21:09 , Sturla Molden wrote:
That tends to be the hard thing to decide, but it should be possible to find out most cases e.g. evaluate the thunks when elements are requested (similar to generators, but do the whole thunk at once), when printing, etc… Or use the numexpr approach and perform the reification explicitly.
But perhaps it is possible to do this with AST magic? I don't know.
I'm not sure there's even a need for AST magic (although you could also play with that by writing operations within lambdas I guess, I've never done much AST analysis/rewriting), it could simply use an approach similar to SQLAlchemy's's ClauseElement: when applying an operation to e.g. an array, rather than perform it just return a representation of the operation itself (effectively rebuild some sort of AST), new operations on *that* would simply build the tree further (composing the thunk), and an explicit evaluation call or implicit evaluation due to e.g. accessing stuff would compile the "potential" operation and perform the actual computations.
Indeed.

On 5 December 2012 18:09, Sturla Molden <sturla@molden.no> wrote:
Today that can be achieved by crafting a class that overrides all ops to perform literal transforms and with a "flush" or "calculate" method. Sympy does something like that, and it would not be hard to have a numpy module to perform like that with numpy arrays. In this particular use case, we'd have the full benefit of "explicit is better than implicit". js -><-

On 5 December 2012 18:09, Sturla Molden <sturla@molden.no> wrote:
FYI, the numpy module shipped with PyPy does exactly this: the operations are recorded in some AST structure, which is evaluated only when the first item of the array is read. This is completely transparent to the user, or to other parts of the interpreter. PyPy uses JIT techniques to generate machine code specialized for the particular AST, and is typically 2x to 5x faster than Numpy, probably because a lot of allocations/copies are avoided. -- Amaury Forgeot d'Arc

On Wed, Dec 5, 2012 at 3:09 PM, Sturla Molden <sturla@molden.no> wrote:
Not all pixel fiddling can be solved using array calculus, so there will always be C involved at some point. Still this could be a great advancement. Though I don't think bind-time is the right time to evaluate anything as it would drive optimizing programmers to "one-line" things. Using intermediate variable names to explain an algorithm is crucial for readability in my experience. Creating intermediate objects only to be evaluated when programmer explicitly demands is the way to go. E.g. "evalf" in sympy http://scipy-lectures.github.com/advanced/sympy.html

On 5 December 2012 20:09, Sturla Molden <sturla@molden.no> wrote:
But perhaps it is possible to do this with AST magic?
As far as I understand it, numba [1] does this kind of AST magic (among other things). https://github.com/numba/numba Thomas

On Wed, Dec 5, 2012 at 10:09 AM, Sturla Molden <sturla@molden.no> wrote:
It' seems a bit more complicated than that. Take the example below. When is __bind__ going to be called? After a is multiplied by x, b is multipled by y, etc. or before? If after, that doesn't accomplish lazy evaluation as below. If before, then somehow this has to convert to a form that calls z.__bind__(something) and what is that something?
There is another way to write expressions that don't get evaluated: lambda: a*x + b*y + c So you could write this as z.bind(lambda: rhs) or if this is important enough there could be a new bind operator: lhs @= rhs which is equivalent to lhs.__bind__(lambda: rhs) I think overriding = so sometimes it does regular binding and sometimes this magic binding would be confusing and dangerous. It means that every assignment operates differently if the lhs is already bound. Consider the difference between t = a + b typo = a+b t @= a+b typo @= a+b where typo was supposed to be t but was mistyped. In the first set, line 1 does __bind__ to a+b while line just adds a and b and does a normal binding. In the second set, the first does __bind__ while the second raises an exception that typo is not bound. It's even worse in the context of something like this: d = {} for i in range(2): d['x'] = a + i in the first pass through the loop this is a regular assignment. In the second pass it may call __bind__ depending on what the value of a + 0 is. Ick. --- Bruce Follow me: http://www.twitter.com/Vroo http://www.vroospeak.com

On 12/5/2012 1:09 PM, Sturla Molden wrote:
An assignment statement mutates the current local namespace. The 'current local namespace' is a hidden input to the function performed by all statements, but it need not be a python object. The key symbol '=' is not an operator and 'a = b' is not an expression.
If one wants to perform 'a = f(b)' or 'a = b.meth()' instead of 'a = b', then one should just explicitly say so.
No, z = (a*x + b*y * c).__bind__, which is how you defined .__bind__ working, still requires that the expression be evaluated to an object. The definition of Python requires computation of a*x, b*y, (a*x + b*y), and finally (a*x + b*y) + c in that order. Either '*' or '+' may have side-effects.
The reason numerical expressions must be written as strings to be efficient in Python is because there is no __bind__ function.
No, it is because the semantics of Python require inefficiency that can only be removed by a special parser-compiler with additional knowledge of the relevant object class, method, and instance properties. Such knowledge allows code to be re-written without changing the effect. For Fortran arrays, the needed information includes the number and length of each dimension. These are either declared or parameterized and passed as arguments. https://code.google.com/p/numexpr/wiki/Overview says that numexpr.evaluate(ex) first calls compile(ex), but does not say whether it has compile compile to cpython bytecode or only to ast. In either case, it converts arraywise operations to blockwise operations inside loops run on a custom C-coded virtual machine. They imply that this is not as good as elementwise operations compiled to native machine code. In any case, it knows that numpy array operations are side-effect free and must use the runtime dimension and size info. -- Terry Jan Reedy
participants (14)
-
Amaury Forgeot d'Arc
-
Barry Warsaw
-
Bruce Leban
-
haael@interia.pl
-
Jasper St. Pierre
-
Joao S. O. Bueno
-
Masklinn
-
Mike Graham
-
random832@fastmail.us
-
Steven D'Aprano
-
Sturla Molden
-
Terry Reedy
-
Thomas Kluyver
-
Yuval Greenfield