[spambayes-dev] Idea for multi-user spambayes

Tim Peters tim.peters at gmail.com
Sun Nov 27 21:23:18 CET 2005


[Neil Schemenauer]
> I have an idea for a spambayes variation that should be more suited
> to multi-user systems.  The goal is to make the DB somewhat
> conditionalized based on recipient address.  In addition to storing
> <token>, spambayes could also save (<recipient>, <token>).  When
> scoring a message, the probability for (<recipient>, <token>) would
> be added to the evidence as well as for <token>.

Offhand I think it would make more sense to ignore <token> when a
(<recipient>, <token>) pair (for the same <token> and the given
<recipient>) is known.  For example, if a urologist trains on "penis"
as ham, it's not doing him a favor to fold in that it's spam to almost
everyone else.

> I'm looking at chi2_spamprob() and wondering if this is valid,
> statistics-wise.

There's really no sense in which chi2_spamprob() computes "a
probability" -- it works or it doesn't.  Heh.

> Is there some better way to include the (<recipient>, <token>) evidence?

Test some ;-)

> BTW, if this idea actually works, using (<sender>, <token>) may also
> be helpful.

Spam sender addresses typically change rapidly, while ham sender
addresses typically don't.  So I expect this would add major boosts to
the tokens sent by ham senders, and typically create a ton of hapaxes
from spam senders (due to the spam <sender> addresses constantly
changing).


More information about the spambayes-dev mailing list