[Python-Dev] Comparison speed

Tim Peters tim.one@home.com
Mon, 14 May 2001 23:36:42 -0400


[Martin v. Loewis]
> ...
> When stepping through the code, I also missed support for the
> relationship between identity and equality. E.g. in
> PyObject_RichCompare, I'd expect
>
>   if (v == w) {
>      switch (op)
>      case Py_EQ:case Py_LE:case Py_GE:
>         Py_INCREF(Py_True);
>         return Py_True;
>      case Py_NE:case Py_LT:case Py_GT:
>         Py_INCREF(Py_False);
>         return Py_False;
>      }
>   }
>
> That would not help in your case, of course. I don't even know how
> frequent comparing identical objects is in real life - but this is
> something that PyObject_Compare has that PyObject_RichCompare
> currently doesn't.

Guido insisted (with cause <wink>) on these four pairs as being equivalent:

    x <  y  iff  y >  x
    x <= y       y >= x
    x == y       y == x
    x != y       y != x

but beyond that, in the presence of rich comparisons, agreed not to make any
other assumptions about what those pixel-bags "mean".  In particular, there's
no implication that "x <= y" iff "x < y or x == y", or that "x < y" implies
"x != y", etc.

Applying that to the above leaves you with nothing but

   if (v == w && op == Py_EQ) /* then return Py_True */

Which is about all PyObject_Compare's

	if (v == w)
		return 0;

assumes too.  So I don't see much future in that.

[later, a patch to fill in the richcmp slot for strings]
> +static PyObject*
> +string_richcompare(PyStringObject *a, PyStringObject *b, int op)
> +{
> +	int c;
> +	PyObject *result;
> +	if (!PyString_Check(b)) {
> +		result = Py_NotImplemented;
> +		goto out;
> +	}
> +	if (op == Py_EQ) {
> +		if (a->ob_size != b->ob_size) {
> +			result = Py_False;
> +			goto out;
> +		}
> +#ifdef CACHE_HASH
> +		if (a->ob_shash != b->ob_shash
> +		    && a->ob_shash != -1
> +		    && b->ob_shash != -1) {
> +			result = Py_False;
> +			goto out;
> +		}
> +#endif
> +	}
> +	c = string_compare(a, b);
> +	switch (op) {
> +	case Py_LT: c = c <  0; break;
> +	case Py_LE: c = c <= 0; break;
> +	case Py_EQ: c = c == 0; break;
> +	case Py_NE: c = c != 0; break;
> +	case Py_GT: c = c >  0; break;
> +	case Py_GE: c = c >= 0; break;
> +	default:
> +		result = Py_NotImplemented;
> +		goto out;
> +	}
> +	result = c ? Py_True : Py_False;
> +  out:
> +	Py_INCREF(result);
> +	return result;

[and that yields about an 8% speedup in the "<" case]

That looks on the right track, but maybe at the wrong level:  why is it
necessary?  That is, the bulk of the "smarts" here in the switch stmt are
type-independent:  if there's no specific implementation of individual
comparisons, but there is a tp_compare, then the switch stmt applies verbatim
to *any* such type.  Do we have to fill in the richcmp slot for everything to
get Python to realize that?  I mean "just about everything", too:  while,
e.g., ceval special-cases "<" for ints, that doesn't do sorting or max or min
etc on ints a lick of good (they don't go thru the COMPARE_OP opcode then,
but thru the general comparison routines).

The "speed problem" appears to be:

+ COMPARE_OP calls cmp_outcome()
+   which calls PyObject_RichCompare()
+     which calls do_richcmp()
+       which calls try_rich_compare() (unsuccessfully now,
                                        successfully after your patch)
          which fails to find a richcmp slot on either operand (now)
          so says "not implemented"
+       then calls try_3way_to_rich_compare()
+         which calls try_3way_compare()
+            which finally calls the tp_compare slot
+            then runs exactly the same
   		switch (op) {
		case Py_LT: c = c <  0; break;
		case Py_LE: c = c <= 0; break;
		case Py_EQ: c = c == 0; break;
		case Py_NE: c = c != 0; break;
		case Py_GT: c = c >  0; break;
		case Py_GE: c = c >= 0; break;
		}
        	result = c ? Py_True : Py_False;
             switch as your patch

and things unwind.  So we've got 7 function calls there, not even counting
calls to PyErr_Occurred() and PyObject_IsTrue(), all to find about 3 machine
instructions that actually do the compare <wink>.

You got an 8% speedup for one type by tricking the switch stmt into appearing
3 calls earlier.  What if the implementation were smarter, and did it for
*all* relevant types even a call or two before that?

I don't see any reason "in principle" that compares couldn't be much faster,
and via the usual gimmicks:  bigger, smarter functions that remember what
they've already determined so don't need to figure it out over and over
again, and fast paths to favor common cases at the expense of comparisons
from Mars.  One thing to note here:  the workhorse comparisons are "like
strings" in having no *logical* need for richcmps at all; and the objects for
which richcmps were introduced were numerical arrays, which can much better
afford a longer code path to *find* them (one matrix compare will trigger
many vanilla element compares anyway, so even for arrays it's much more
important that the *latter* be fast).  The code now is approximately
backwards in that respect (it takes gobs of work before we even *look* for a
cmp now -- indeed, if a type has both cmp and richcmp slots now, and we're
doing an explict "cmp" compare, the code now tries to *simulate* cmp first
via a long sequence of richcmp calls!).

I don't have time to uglify this code, but Python would benefit from it.

and-no-matter-what-guido-may-say<wink>-ly y'rs  - tim