[IronPython] Object ids and hash code

Tue Mar 29 04:34:43 CEST 2005

Sriram Krishnan wrote:
> 
> I saw that IronPython 0.7 uses
> System.Runtime.CompilerServices.RuntimeHelpers to get the hash code of
an
> object (thereby removing the need for the Reflection.Emit/util dll
hack in
> 0.6). However, if I remember correctly, .NET hash codes are *not*
> guaranteed
> to be unique. Brad Abrams talks about it here
> http://blogs.msdn.com/brada/archive/2003/09/30/50396.aspx.
> 
> Doesn't this open up problems when 2 objects have the same id?

Thanks for an excellent and tricky technical question.  I'll answer this
one tonight and take a crack at some of the other questions on the
morrow.

The previous hack and the current use of RuntimeHelpers to implement
Python's id builtin both do the same thing and return the result of the
default non-overridden Object.HashCode method called on the given
object.  So, the 0.7 code is absolutely better than 0.6 because it is
simpler and doesn't have weird build process issues anymore while still
doing as good a job of matching Python's id function.

The open question remains as to whether or not this is good enough.
Jython used Java's System.identityHashcode function to implement
Python's id builtin from the very beginning.  This function behaves very
similarly to the RuntimeHelpers function that I'm using in IronPython.
They both almost always generate a unique int for a given object but
they are not 100% guaranteed to do so.  I can't recall ever seeing a bug
report from a Jython user indicating that this was a real issue for
them.  However, I can certainly imagine an odd corner case where this
could produce an extremely hard to understand and fix bug.  This is
likely to be one of the detailed Python compatibility issues where we
will spend some design time before getting to 1.0.

This issue is coupled to the fact that IronPython, like Jython, uses a
true garbage collector rather than a reference counting system.  One
aspect of these GC's is that they will move objects around in order to
compact heap space and this means that an object's memory address can
change over its lifetime.  This makes it hard to precisely match
Python's semantics for id.

Note: This doesn't make it impossible to match the semantics of id.  A
simple implementation would be to use a Dictionary tied to the id
function that kept track of every object passed to it and when it saw an
object that wasn't in the Dictionary it would increment an int to the
next value and then return that int.  Obviously there'd be some
performance and memory issues with this approach.

The challenge with IronPython is very similar to that of Jython.  To
strike the right balance between perfect compatiblity with CPython (the
name for the standard implementation of Python) and working as well as
possible with the target platform.  I suspect that the right answer here
for IronPython will be the same as the Jython decision; however, I'd be
interested in learning about existing Python programs that have a
critical dependency on the semantics of id.

I can tell you about one difference between IronPython and CPython that
I'm sure won't be going away.  IronPython will never guarantee the
deterministic finalization that you can get in CPython.  The classic
example is this code:

    text = open("foo.txt", "r").read()

In standard CPython you are guaranteed that the file object will be
closed (assuming that the builtin open function was not overridden)
because the moment the last reference to it goes away the destructor
will be called and the file will be closed.  IronPython (and Jython)
does not offer this guarantee.  FYI - There are tricks that can be
played for this particular very special case of file objects that
IronPython may decide to use, but the general case of finalization at
the exact moment the last reference goes away won't be solved.

I was asked this question during my PyCon talk and Guido van Rossum was
kind enough to answer it for me.  The community had already been through
this discussion in earnest back in the early Jython days and in the end
concluded that it was acceptable for a Python implementation to not use
a reference count based GC.  The winning argument was that this was an
incredibly risky feature to depend on and that Python programmers on any
platform could improve their code by not depending on the precise
semantics of reference counting.  Because we've already had this
discussion once before in great depth it was easy to make this
particular decision for IronPython.

Thanks - Jim