[Spambayes] Chi-squared perl port problems

Matt Sergeant msergeant@startechgroup.co.uk
Thu Nov 7 14:21:11 2002


[Moderators - sent this from the wrong address. Please kill that mail]

OK, I've tried to convert your chi-squared stuff to Perl, but for some
reason it's producing bizarre results. It always scores low. And I have
no idea why, because I thought I'd copied the code pretty much verbatim
(albeit adding in a few $'s and {}'s ;-)

First of all, here's the token scores an email in question gets:

received-ip:207.230.250.119                        => 1.00000
AWESOME                                            => 1.00000
BUKAKKE                                            => 1.00000
GALLERIES!                                         => 1.00000
received-ip:218.53.86.224                          => 1.00000
orgasmic                                           => 1.00000
jism                                               => 1.00000
barrages                                           => 1.00000
href=http://205.197.95.39/users/belinda/bukkakehouse/index.html => 1.00000
href=http://205.197.95.39/remove.php               => 1.00000
border=20                                          => 1.00000
bukakke                                            => 1.00000
from:<hotchicks@ibm.com>                           => 1.00000
color=#FFCC33                                      => 1.00000
color=#FFCC99                                      => 1.00000
from:"Carol"                                       => 1.00000
size=+3                                            => 0.91967
content-type:text/html                             => 0.90484
size=+4                                            => 0.89516
bgcolor=#000000                                    => 0.89012
size=+2                                            => 0.88557
instant                                            => 0.79354
align=center                                       => 0.78635
access!                                            => 0.77353
color=#FFFFFF                                      => 0.77167
remove                                             => 0.76071
width=600                                          => 0.75813
now!                                               => 0.75306
click                                              => 0.66231
bukkake                                            => 0.59412
20123                                              => 0.59412
faces!                                             => 0.59412
color=#FF6600                                      => 0.55386
here                                               => 0.46364
face=verdana                                       => 0.44700
yourself                                           => 0.43374
action.                                            => 0.35701
bordercolor=Black                                  => 0.32793
blow                                               => 0.28531
stop                                               => 0.25280
japanese                                           => 0.14122
drenched                                           => 0.10872
facial                                             => 0.07652
please                                             => 0.07193
drinking                                           => 0.06229
house                                              => 0.04293

The resulting score my chi-squared code gives this is 0.331736284189509
- which to me is obviously incorrect (if you pass it through Paul
Graham's method it scores 1.0).

So here's the code I'm using:

      if (1) {
          # Chi-Squared method. Produces mostly boolean $result
          # but with a grey area.
          my ($H, $S);
          my ($Hexp, $Sexp);
          $H = $S = 1.0;
          $Hexp = $Sexp = 0;

          my $num_clues = @sorted;

          foreach my $row (@sorted) {
              $S *= 1.0 - $row->[PROB];
              $H *= $row->[PROB];
              if ($S < 1e-200) {
                  my $e;
                  ($S, $e) = frexp($S);
                  $Sexp += $e;
              }
              if ($H < 1e-200) {
                  my $e;
                  ($H, $e) = frexp($H);
                  $Hexp += $e;
              }
          }

          $S = log($S) + $Sexp + LN2;
          $H = log($H) + $Hexp + LN2;

          if ($num_clues) {
              $S = 1.0 - chi2q(-2.0 * $S, 2 * $num_clues);
              $H = 1.0 - chi2q(-2.0 * $H, 2 * $num_clues);

              $result = (($S - $H) + 1.0) / 2.0;
          }
          else {
              $result = 0.5;
          }
      }

And here's the chi2q routine, if that's relevant:

# Chi-squared function
sub chi2q {
      my ($x2, $v) = @_;

      die "v must be even in chi2q(x2, v)" if $v & 1;
      my $m = $x2 / 2.0;
      my ($sum, $term);
      $sum = $term = exp(0 - $m);
      for my $i (1 .. ($v >> 2)) {
          $term *= $m / $i;
          $sum += $term;
      }
      return $sum < 1.0 ? $sum : 1.0;
}

I also added some debugging output so that I could see the three stages
of S and H (after the loop, after the log(), and after the chi2q bit).
Here's the output from that:

S1=1e-10; H1=1.25335384490988e-12
S2=-22.3327037493805; H2=-26.7120509011492
S3=0.389722189708954; H3=0.726249621329936

If you can help me at all, I would *really* appreciate it, as I honestly
can't see where your code and mine differs. Thanks!

Matt.




More information about the Spambayes mailing list