[Spambayes] Corpus modules

Wed Nov 13 16:59:50 2002

In message:  <B9F80E7C.5C779%francois.granger@free.fr>
             <francois.granger@free.fr> writes:
>
>I was thinking of hacking the DB mechanisme to split the load between two
>databases (using anydbm) to reduce access to each one and to make them more
>accessible from outside. The scoring module needs only the second one. The
>training module would update both. I suspected that a major redesign was
>underway. Here the proposed split.
>{'word': ['ltime',     # when this record was last modified
>          'spamcount', # of spams in which this word appears
>          'hamcount',  # of hams in which this word appears
>         ]
>}
>{'word': ['atime',     # when this record was last used by scoring(*)
>          'killcount', # of times this made it to spamprob()'s nbest
>          'spamprob',  # prob(spam | msg contains this word)
>          ]
>}
>
>A 'dirty' flag could be added to the first so that a batch update of the
>second would recalculate only the dirty records.

I am in the process of doing a very similar split, although
I've (for my private stuff) made a few simplifications:

1. I don't keep track of modification and access times.
   Nothing references them, and I'm more in favor of the
   aging methods which keep the actual wordlists for
   messages around until the message as a whole is slated
   for untraining.

2. I don't keep track of killcounts.  Again, nothing
   references them, and I really don't care which clues
   are being used a lot.

Also, when a training (or untraining) event occurs, I
completely trash the second database.  This is warranted
in most cases, since the number of spam and/or ham has
changed, and thus (almost) all the spamprobs are invalidated.
This saves us from needing a dirty flag.

As I score messages, I fetch spamprobs from the second
database, and if they aren't there, I compute them based
on the first database.  (If the words aren't in the first
database either, then just use the unknown word probability
and don't bother storing in the second database.)

Initial tests show a 4% speed hit on large batch training
and testing.  On the other hand, it speeds up the 'score
one, train one' runs immensely.

I've got a few bugs yet, and it's rather intrusive...
which is why I haven't checked it in.

- Alex