[Spambayes] Chi-squared perl port problems
Matt Sergeant
msergeant@startechgroup.co.uk
Thu Nov 7 14:21:11 2002
[Moderators - sent this from the wrong address. Please kill that mail]
OK, I've tried to convert your chi-squared stuff to Perl, but for some
reason it's producing bizarre results. It always scores low. And I have
no idea why, because I thought I'd copied the code pretty much verbatim
(albeit adding in a few $'s and {}'s ;-)
First of all, here's the token scores an email in question gets:
received-ip:207.230.250.119 => 1.00000
AWESOME => 1.00000
BUKAKKE => 1.00000
GALLERIES! => 1.00000
received-ip:218.53.86.224 => 1.00000
orgasmic => 1.00000
jism => 1.00000
barrages => 1.00000
href=http://205.197.95.39/users/belinda/bukkakehouse/index.html => 1.00000
href=http://205.197.95.39/remove.php => 1.00000
border=20 => 1.00000
bukakke => 1.00000
from:<hotchicks@ibm.com> => 1.00000
color=#FFCC33 => 1.00000
color=#FFCC99 => 1.00000
from:"Carol" => 1.00000
size=+3 => 0.91967
content-type:text/html => 0.90484
size=+4 => 0.89516
bgcolor=#000000 => 0.89012
size=+2 => 0.88557
instant => 0.79354
align=center => 0.78635
access! => 0.77353
color=#FFFFFF => 0.77167
remove => 0.76071
width=600 => 0.75813
now! => 0.75306
click => 0.66231
bukkake => 0.59412
20123 => 0.59412
faces! => 0.59412
color=#FF6600 => 0.55386
here => 0.46364
face=verdana => 0.44700
yourself => 0.43374
action. => 0.35701
bordercolor=Black => 0.32793
blow => 0.28531
stop => 0.25280
japanese => 0.14122
drenched => 0.10872
facial => 0.07652
please => 0.07193
drinking => 0.06229
house => 0.04293
The resulting score my chi-squared code gives this is 0.331736284189509
- which to me is obviously incorrect (if you pass it through Paul
Graham's method it scores 1.0).
So here's the code I'm using:
if (1) {
# Chi-Squared method. Produces mostly boolean $result
# but with a grey area.
my ($H, $S);
my ($Hexp, $Sexp);
$H = $S = 1.0;
$Hexp = $Sexp = 0;
my $num_clues = @sorted;
foreach my $row (@sorted) {
$S *= 1.0 - $row->[PROB];
$H *= $row->[PROB];
if ($S < 1e-200) {
my $e;
($S, $e) = frexp($S);
$Sexp += $e;
}
if ($H < 1e-200) {
my $e;
($H, $e) = frexp($H);
$Hexp += $e;
}
}
$S = log($S) + $Sexp + LN2;
$H = log($H) + $Hexp + LN2;
if ($num_clues) {
$S = 1.0 - chi2q(-2.0 * $S, 2 * $num_clues);
$H = 1.0 - chi2q(-2.0 * $H, 2 * $num_clues);
$result = (($S - $H) + 1.0) / 2.0;
}
else {
$result = 0.5;
}
}
And here's the chi2q routine, if that's relevant:
# Chi-squared function
sub chi2q {
my ($x2, $v) = @_;
die "v must be even in chi2q(x2, v)" if $v & 1;
my $m = $x2 / 2.0;
my ($sum, $term);
$sum = $term = exp(0 - $m);
for my $i (1 .. ($v >> 2)) {
$term *= $m / $i;
$sum += $term;
}
return $sum < 1.0 ? $sum : 1.0;
}
I also added some debugging output so that I could see the three stages
of S and H (after the loop, after the log(), and after the chi2q bit).
Here's the output from that:
S1=1e-10; H1=1.25335384490988e-12
S2=-22.3327037493805; H2=-26.7120509011492
S3=0.389722189708954; H3=0.726249621329936
If you can help me at all, I would *really* appreciate it, as I honestly
can't see where your code and mine differs. Thanks!
Matt.
More information about the Spambayes
mailing list