Comparison of cyclic objects (was RE: [Python-Dev] trashcan and PR#7)

Jeremy Hylton jeremy@cnri.reston.va.us
Thu, 13 Apr 2000 15:08:12 -0400 (EDT)


>>>>> "TP" == Tim Peters <tim_one@email.msn.com> writes:

  TP> [Jeremy Hylton]>
  >> So the real problem is defining some reasonable semantics for
  >> comparison of recursive objects.

  TP> I think this is exactly a graph isomorphism problem, since
  TP> Python always compares "by value" (so isomorphism is the natural
  TP> generalization).

I'm not familiar with any algorithms for the graph isomorphism
problem, but I took a stab at a simple comparison algorithm.  The idea
is to detect comparisons that would cross back-edges in the object
graphs.  Instead of starting a new comparison, assume they are the
same.  If, in fact, the objects are not the same, they must differ in
some other way; some other part of the comparison will fail.

  TP> This isn't hard (!= tedious, alas) to define or to implement
  TP> naively, but a straightforward implementation would be very
  TP> expensive at runtime compared to the status quo.  That's why
  TP> "real languages" <wink> would rather suffer an infinite loop.
  TP> It's expensive because there's no cheap way to know whether you
  TP> have a loop in an object.

My first attempt at implementing this is expensive.  I maintain a
dictionary that contains all the object pairs that are currently being 
compared.  Specifically, the dictionary is used to implement a set of
object id pairs.  Every call to PyObject_Compare will add a new
pair to the dictionary when it is called and remove it when it returns 
(except for a few trivial cases).

A naive patch is included below.  It does seem to involve a big
performance hit -- more than 10% slower on pystone.  It also uses a
lot of extra space.

Note that the patch has all its initialization code inline in
PyObject_Compare; moving that elsewhere will help a little.  It also
use a bunch of function calls where macros would be more efficient.

  TP> An anal compromise would be to run comparisons full speed
  TP> without trying to detect loops, but if the recursion got "too
  TP> deep" break out and start over with an expensive alternative
  TP> that does check for loops.  The later requires machinery similar
  TP> to copy.deepcopy's.

It looks like the anal compromise might be necessary.  I'll
re-implement the patch more carefully and see what the real
effect on performance is.

Jeremy

Index: object.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/object.c,v
retrieving revision 2.67
diff -r2.67 object.c
239c239
<                                    "__repr__ returned non-string (type %.200s)",
---
>                                    "__repr__ returned non-string (type %s)",
276c276
<                            "__str__ returned non-string (type %.200s)",
---
>                            "__str__ returned non-string (type %s)",
300a301,328
> static PyObject *cmp_state_key = NULL;
> 
> static PyObject*
> cmp_state_make_pair(v, w)
>       PyObject *v, *w;
> {
>       PyObject *pair = PyTuple_New(2);
>       if (pair == NULL)
>               return NULL;
>       if ((long)v <= (long)w) {
>               PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)v));
>               PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)w));
>       } else {
>               PyTuple_SET_ITEM(pair, 0, PyInt_FromLong((long)w));
>               PyTuple_SET_ITEM(pair, 1, PyInt_FromLong((long)v));
>       }
>       return pair;
> }
> 
> void
> cmp_state_clear_pair(dict, key)
>       PyObject *dict, *key;
> {
>       PyDict_DelItem(dict, key);
>       Py_DECREF(key);
> }
> 
> 
305a334,336
>       PyObject *tstate_dict, *cmp_dict, *pair;
>       int result;
> 
311a343,376
>       tstate_dict = PyThreadState_GetDict();
>       if (tstate_dict == NULL) {
>               PyErr_BadInternalCall();
>               return -1;
>       } 
> /*    fprintf(stderr, "PyObject_Compare(%X: %s, %X: %s)\n", (long)v, 
>               v->ob_type->tp_name, (long)w, w->ob_type->tp_name);
> */
>       /* XXX should initialize elsewhere */
>       if (cmp_state_key == NULL) {
>               cmp_state_key = PyString_InternFromString("compare_state"); 
>               cmp_dict = PyDict_New();
>               if (cmp_dict == NULL)
>                       return -1;
>               PyDict_SetItem(tstate_dict, cmp_state_key, cmp_dict);
>       } else {
>               cmp_dict = PyDict_GetItem(tstate_dict, cmp_state_key);
>               if (cmp_dict == NULL)
>                       return NULL;
>               PyDict_SetItem(tstate_dict, cmp_state_key, cmp_dict);
>       }
>              
>       pair = cmp_state_make_pair(v, w);
>       if (pair == NULL) {
>               PyErr_BadInternalCall();
>               return -1;
>       }
>       if (PyDict_GetItem(cmp_dict, pair)) {
>               /* already comparing these objects.  assume they're
>                  equal until shown otherwise 
>               */
>               Py_DECREF(pair);
>               return 0;
>       }
316a382,384
>               if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
>                       return -1;
>               }
317a386
>               cmp_state_clear_pair(cmp_dict, pair);
329a399,401
>       if (PyDict_SetItem(cmp_dict, pair, pair) == -1) {
>               return -1;
>       }
344a417
>                               cmp_state_clear_pair(cmp_dict, pair);
350,364c423,425
<               else if (PyUnicode_Check(v) || PyUnicode_Check(w)) {
<                       int result = PyUnicode_Compare(v, w);
<                       if (result == -1 && PyErr_Occurred() && 
<                           PyErr_ExceptionMatches(PyExc_TypeError))
<                               /* TypeErrors are ignored: if Unicode coercion
<                               fails due to one of the arguments not
<                               having the right type, we continue as
<                               defined by the coercion protocol (see
<                               above). Luckily, decoding errors are
<                               reported as ValueErrors and are not masked
<                               by this technique. */
<                               PyErr_Clear();
<                       else
<                               return result;
<               }
---
>               cmp_state_clear_pair(cmp_dict, pair);
>               if (PyUnicode_Check(v) || PyUnicode_Check(w))
>                       return PyUnicode_Compare(v, w);
372c433,434
<       if (vtp->tp_compare == NULL)
---
>       if (vtp->tp_compare == NULL) {
>               cmp_state_clear_pair(cmp_dict, pair);
374c436,439
<       return (*vtp->tp_compare)(v, w);
---
>       }
>       result = (*vtp->tp_compare)(v, w);
>       cmp_state_clear_pair(cmp_dict, pair);
>       return result;