[spambayes-dev] Pickle vs DB inconsistencies

Fri Jun 13 01:16:27 EDT 2003

[Greg Ward, wrestling with mysteries]

I'd presume that the pickle code is correct.  What you're seeing is
consistent with that (the scores appear the same after untraining and
retraining when using the pickle, but don't when using the DB).  I suspect
there's something wrong with the DB code, as for months we've gotten reports
of odd bugs from DB users that nobody using the pickle code has reported.
The storage format isn't so interesting as that the pickle code uses a plain
Python dict *during* training and scoring -- at the start, that code was so
simple it was obviously correct.  It all got a lot more complicated, via
layers of indirection, to cater to the DB backend, but the runtime dict is
still a lot simpler than building a funky cache by hand on top of an
external DB.  I still haven't tried the DB code.

For better clues, replace your %.3f and %.2f formats with %.17g, i.e. print
values to full machine precision (or print repr(some_float) -- close to the
same thing).  Then we don't have to guess whether values are "just close",
we can see whether they are (or aren't) in fact identical.

Also, as Tony said, if you train on just a couple messages, it will be
straightforward to Pronounce on exactly what should have happened, down to
the 17th digit.  That would tell us for sure which scheme is hosed, and then
digging will reveal how.