[Spambayes] Database reduction

Tim Peters tim.one@comcast.net
Thu Oct 31 03:47:35 2002


[Tim]
> ...
> A cheesy but probably-effective trick is to pick an integer N for
> all time, and index a wordinfo structure by hash(S) % N instead of by S.

FYI, if you want to pursue this, here's a start (there's not much to it if
you just want to see what happens):

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.45
diff -c -u -r1.45 classifier.py
--- classifier.py	27 Oct 2002 17:11:00 -0000	1.45
+++ classifier.py	31 Oct 2002 03:33:40 -0000
@@ -40,6 +40,9 @@

 PICKLE_VERSION = 1

+def HashSet(words):
+    return [n % 100003 for n in map(hash, Set(words))]
+
 class WordInfo(object):
     __slots__ = ('atime',     # when this record was last used by
scoring(*)
                  'spamcount', # # of spams in which this word appears
@@ -320,11 +323,11 @@
             # adjustment following keeps them in a sane range, and one
             # that naturally grows the more evidence there is to back up
             # a probability.
-            hamcount = record.hamcount
+            hamcount = min(record.hamcount, nham)
             assert hamcount <= nham
             hamratio = hamcount / nham

-            spamcount = record.spamcount
+            spamcount = min(record.spamcount, nspam)
             assert spamcount <= nspam
             spamratio = spamcount / nspam

@@ -397,7 +400,7 @@
         wordinfo = self.wordinfo
         wordinfoget = wordinfo.get
         now = time.time()
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is None:
                 record = wordinfo[word] = WordInfo(now)
@@ -419,7 +422,7 @@
             self.nham -= 1

         wordinfoget = self.wordinfo.get
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is not None:
                 if is_spam:
@@ -440,7 +443,7 @@

         wordinfoget = self.wordinfo.get
         now = time.time()
-        for word in Set(wordstream):
+        for word in HashSet(wordstream):
             record = wordinfoget(word)
             if record is None:
                 prob = unknown

Since N is 100003 there, no more than 100003 "words" can exist in the
database.  On my large c.l.py test, about 325,000 unique words exist, so at
least 225,000 words get folded into other words.  Accuracy does suffer:

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       4
fp %:         0.01    0.02
fn total:        0       1
fn %:         0.00    0.01
unsure t:       97     179
unsure %:     0.29    0.53
real cost:  $39.40  $76.80
best cost:  $26.80  $58.20
h mean:       0.26    0.42
h sdev:       2.89    3.39
s mean:      99.94   99.70
s sdev:       1.44    3.20
mean diff:   99.68   99.28
k:           23.02   15.07

although the distros remain highly skewed:

-> <stat> Ham scores for all runs: 20000 items; mean 0.42; sdev 3.39
-> <stat> min 0; median 2.4147e-006; max 100
-> <stat> percentiles: 5% 2.22045e-014; 25% 1.57802e-009; 75% 0.000878141;
95% 0.588783

-> <stat> Spam scores for all runs: 14000 items; mean 99.70; sdev 3.20
-> <stat> min 17.485; median 100; max 100
-> <stat> percentiles: 5% 99.9864; 25% 100; 75% 100; 95% 100

-> best cost for all runs: $58.20
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.435 & 0.93
->     fp 2; fn 7; unsure ham 27; unsure spam 129
->     fp rate 0.01%; fn rate 0.05%; unsure rate 0.459%
-> largest ham & spam cutoffs 0.44 & 0.93
->     fp 2; fn 7; unsure ham 27; unsure spam 129
->     fp rate 0.01%; fn rate 0.05%; unsure rate 0.459%

There's not much point digging into "what went wrong" in the new error
cases, since the list of clues is worthless; e.g., here's the list for one
of the new FP:

Data/Ham/Set1/64316.txt
prob = 0.838786702949
prob('*H*') = 0.130012
prob('*S*') = 0.807586
prob(744) = 0.0228281
prob(34690) = 0.0918367
prob(87505) = 0.0970545
prob(91999) = 0.304589
prob(29591) = 0.328993
prob(70192) = 0.371651
prob(46915) = 0.634625
prob(60034) = 0.646331
prob(49959) = 0.648468
prob(63366) = 0.686216
prob(66610) = 0.702733
prob(25331) = 0.731237
prob(81757) = 0.747858
prob(13278) = 0.751421
prob(89046) = 0.758242
prob(13498) = 0.773519
prob(5337) = 0.779329
prob(26879) = 0.805219
prob(50301) = 0.912593
prob(26426) = 0.918411
prob(35130) = 0.943716

I can say that all the new errors were difficult cases before this too, and
often popped in out of my FP and FN sets over the weeks.

Have fun <wink>.