[Spambayes] Results of playing with CDB
Tim Peters
tim.one@comcast.net
Mon, 16 Sep 2002 11:25:28 -0400
[Neale Pickett]
> Sort of. About half of them use pine on my server, and the other half
> are just figuring out how to browse the web, and asking them to install
> stuff is too tall an order.
So they do have their own CPUs and disk drives? Whether it's too hard to
install a thing depends on how the installer is written; if a newbie can
click a button, they can install.
> On the plus side, their inboxes are likely to be very jargon-free.
The lack of jargon is likely to hurt more than help -- the classifier gets
as much good out of finding "good words" as "bad" ones, but the set of good
words likely varies across users.
> My end goal is a centralized classifier for an entire organization on an
> embedded device. At $FIRM, our current-generation devices have between
> 4MB and 16MB of flash and no hard drive. The next ones we make will
> have more, but I don't know what that'll be yet. This is why I'm so
> concerned about storage space, though :)
I'll confess that I'm not opposed to slashing memory requirements <wink>.
> ...
> Plagued by conscience, I've just run my 1000 test hams against your
> SpamHam1.pik classifier.
Using the current code base? The current code base doesn't match that
classifier, alas -- e.g., whenever the tokenization changes, the "words"
produced can be very different. There are, for example, lots of spam words
in that database that the current tokenizer *can't* produce anymore, due to
things like changing the way URLs are tokenized.
> It came back saying that 51 were spam, and 949 were ham. Not bad, eh?
Most people would find a 5% f-p rate unacceptable, and I've demonstrated
rates over 100x smaller with the current code base and sufficient training.
> I've put the output of a "hammie.py -u" run at
>
> http://woozle.org/~neale/tmp/results.txt
Case in point <wink>: there's one f-p there containing all this stuff:
'clergy': 0.99;
'cart': 0.99;
'literature,': 0.99;
'assortment': 0.99;
'headquarters': 0.99;
'postal': 0.99;
'subject:confirmation': 0.99;
'ministry,': 0.99;
'conducting':
'ceremonies': 0.99;
'orders': 0.99;
'ordination.': 0.99;
'shopping': 0.99;
'email name:sales':
'ordination': 0.99;
'ministry.': 0.99;
'andrews': 0.99;
'baptisms': 0.99;
'booklet': 0.99;
'from:"mr.': 0.99;
'california.': 0.99;
'comply': 0.99;
'price,': 0.99;
'minister': 0.99;
'materials': 0.99;
'ordained': 0.99;
'deliveries': 0.99;
'officiating': 0.99;
'funerals,': 0.99;
'subject:Welcome': 0.99;
'shipping*': 0.99;
'rites': 0.99
Now if you've got one user who sucked for a minister-by-mail scam, training
a classifier to view this as ham is going to let similar scams through to
all your users.
> All default values in my .ini file. That is to say, I have no .ini file
> :) I'll go over unread messages and see if you need any other tests run.
> Sorry I've been slackin'.
Don't use SpamHam1.pik with the current codebase; I'll delete that file now
so nobody else falls into this trap.