[Spambayes] Spambayes error

Meyer, Tony T.A.Meyer at massey.ac.nz
Wed May 21 17:48:31 EDT 2003


> I've never seen it either, but also use the Outlook client 
> (and exclusively).  I'm surprised we haven't gotten another 
> clue in all this time!

Me, too.  I'm also surprised that it doesn't show up in testing - we
don't use the Outlook client for that...

> Who's cheating <wink -- but if a 
> client is going thru Classifier's
> learn() and unlearn() interfaces, it's easy to show that 
> hamcount <= nham is a global invariant>?

It really is bizarre.  It must surely be introduced when reading or
writing the db.  I also have my suspicions about the classifier.py lines
that "account for string" ham/spam counts - they should always be ints,
and never strings, and I think it's hiding a different problem.  (I've
changed this in my local copy, but since I don't come across the error,
I don't think it will help).

> Hold on.  If the maximum record spamcount is 860909, that 
> means Richard trained on 860,909 distinct spam messages all 
> of which contained a common word.  That's very hard to 
> believe.  Sounds more like the database has gone out to lunch.

You know, that's a good point... :)  Perhaps I should have thought about
the numbers instead of just copying them...

Looking at the word list, there are a lot of words with the same count
(1296, for example).  That might indicate that messages have been
trained a *lot* of times.  The words at the top are those that you would
expect to be there, however - "the", "you", "header:subject:1",
"header:to:1", "header:from:1", "proto:http", and so on.  And the count
does go down to single figures (no hapaxes, however).

So, Richard, if you have trained over 861,000 unique spam messages,
things may be ok, but if not you might want to retrain your db.

=Tony Meyer



More information about the Spambayes mailing list