[spambayes-dev] Mozilla SpamBayes "porting"
Miguel
miguel at vargas.com
Wed Feb 18 17:21:39 EST 2004
Sorry for the semi-offtopic post, but any help will be very apreciated.
Mozilla's mail apps currently use Paul Graham's original algorithm with some basic tokenizing techniques. This
situation could use some improvement, so now there is an effort to copy some ideas from Spambayes into Mozilla.
I wrote a Mozilla patch that tries to port the chi2-combining techniques from classifier.py into Mozilla's C++. My
testing is showing huge improvements in the fn rates, but a big deterioration in the fp rates. For example, in a test
with a 3,741 email corpus we got:
original - fn:206 fp:0
chi2 patch - fn:63 fp:11
My question is, did you guys notice a similar increase in fp rates when you originally switched from Graham to chi2? If
not, then I'll assume that I made a mistake in porting classifier.py.
Many thanks,
Miguel
PS.
If anyone is interested in what Mozilla is doing, you can look here:
http://bugzilla.mozilla.org/show_bug.cgi?id=181534
http://bugzilla.mozilla.org/show_bug.cgi?id=230093
http://bugzilla.mozilla.org/show_bug.cgi?id=231873
Here is the core of my C++ port if anyone wants to take a look. You'll notice that I included the
"experimental_ham_spam_imbalance_adjustment", could this be my problem?
double spam2ham = dmin(nbad/ngood, 1.0);
double ham2spam = dmin(ngood/nbad, 1.0);
/** This section comes from probability(self, record) and _getclues(self, wordstream)**/
for (i = 0; i < count; ++i) {
Token& token = tokens[i];
// tokens is an array of Token, elements of a Token
// include both token.mProbability and token.mDistance
const char* word = token.mWord;
Token* t = mGoodTokens.get(word);
double hamcount = ((t != NULL) ? t->mCount : 0);
t = mBadTokens.get(word);
double spamcount = ((t != NULL) ? t->mCount : 0);
prob = (spamcount / nbad) / ( hamcount / ngood + spamcount / nbad);
double n = hamcount * spam2ham + spamcount * ham2spam;
prob = (0.225 + n * prob) / (.45 + n);
double distance = abs(prob - 0.5);
if (distance >= .1) {
goodclues++;
token.mDistance = distance;
token.mProbability = prob;
} else {
token.mDistance = -1; //ignore clue
}
}
// sort the array by the token distances
PRUint32 first, last = count;
if (count > 150) {
first = count - 150;
// This function sorts the array by token.mDistance
NS_QuickSort(tokens, count, sizeof(Token), compareTokens, NULL);
} else {
first = 0;
}
/** This section comes from chi2_spamprob(self, wordstream, evidence=False) **/
double H = 1.0, S = 1.0, Hexp = 0, Sexp = 0;
goodclues=0;
int e;
for (i = first; i < last; ++i) {
if (tokens[i].mDistance != -1) {
goodclues++;
double value = tokens[i].mProbability;
S *= (1.0 - value);
H *= value;
if ( S < 1e-200 ) {
S = frexp(S, &e);
Sexp += M_E;
}
if ( H < 1e-200 ) {
H = frexp(H, &e);
Hexp +=M_E;
}
}
}
S = log(S) + Sexp * M_LN2;
H = log(H) + Hexp * M_LN2;
if (goodclues>0) {
S = 1.0 - chi2Q(-2.0 * S, 2 * goodclues);
H = 1.0 - chi2Q(-2.0 * H, 2 * goodclues);
prob = (S-H +1.0) / 2.0;
} else {
prob = 0.5;
}
PRBool isJunk = (prob >= 0.90); //hardcoded at .9
------------------------------------
Here's the chi2Q funcition:
static double chi2Q (double x2, double v) {
PRUint32 i;
double m = x2 / 2.0;
double sum = exp(-m);
double term = exp(-m);
for (i=1;i<=floor(v/2);i++) {
term *= m / i;
sum += term;
}
return dmin(sum,1.0);
}
More information about the spambayes-dev
mailing list