[Spambayes] There Can Be Only One

Fri, 27 Sep 2002 16:46:31 -0400

[Guido van Rossum]
> Can't you just switch to Gary by default without nuking the Graham
> code?  A negative result is also a result!

I thought I covered this in the first msg.  The sheer number of options and
alternatives is overwhelming to most people, and even I'm having a hard time
keeping them all straight and working.  The technical core of this project
has been about killing non-winners from the start, else there are an
exponentially growing number of choices to deal with, almost all of which
have been proven not to help.  TESTING.txt has warned about that from the
start too.

> Are there considerations of database size,

No, the database is identical either way.

> classification speed,

Dominated by tokenization and I/O time.  If you set max_discriminators to a
very large value (which hasn't done anyone any good in testing, although
it's known to hurt under the Graham scheme), it's significantly slower
because the priority queue gimmick was designed for a small number.

> or required corpus size that could cause someone to prefer Graham?

Untested, and there are much more important things than that to test.

> (IOW are there any circumstances where Gary might be considered
> unfeasible but where Graham would work?)

I doubt it, but don't know.  This project can't make progress unless it
gives up on dead ends; there's never perfect evidence that a thing *is* a
dead end, but the evidence for this one is enough (by my lights).

> Also, the tuning you've done to Graham may be useful for people who
> reimplement Graham in another language.

I can tag the CVS repository before nuking the code.  I'm not going to
resurrect Python's samplesort hybrid either <wink>.