[Spambayes] Results of playing with CDB

Mon, 16 Sep 2002 11:25:28 -0400

[Neale Pickett]
> Sort of.  About half of them use pine on my server, and the other half
> are just figuring out how to browse the web, and asking them to install
> stuff is too tall an order.

So they do have their own CPUs and disk drives?  Whether it's too hard to
install a thing depends on how the installer is written; if a newbie can
click a button, they can install.

> On the plus side, their inboxes are likely to be very jargon-free.

The lack of jargon is likely to hurt more than help -- the classifier gets
as much good out of finding "good words" as "bad" ones, but the set of good
words likely varies across users.

> My end goal is a centralized classifier for an entire organization on an
> embedded device.  At $FIRM, our current-generation devices have between
> 4MB and 16MB of flash and no hard drive.  The next ones we make will
> have more, but I don't know what that'll be yet.  This is why I'm so
> concerned about storage space, though :)

I'll confess that I'm not opposed to slashing memory requirements <wink>.

> ...
> Plagued by conscience, I've just run my 1000 test hams against your
> SpamHam1.pik classifier.

Using the current code base?  The current code base doesn't match that
classifier, alas -- e.g., whenever the tokenization changes, the "words"
produced can be very different.  There are, for example, lots of spam words
in that database that the current tokenizer *can't* produce anymore, due to
things like changing the way URLs are tokenized.

> It came back saying that 51 were spam, and 949 were ham.  Not bad, eh?

Most people would find a 5% f-p rate unacceptable, and I've demonstrated
rates over 100x smaller with the current code base and sufficient training.

>  I've put the output of a "hammie.py -u" run at
>
>   http://woozle.org/~neale/tmp/results.txt

Case in point <wink>:  there's one f-p there containing all this stuff:

'clergy': 0.99;
'cart': 0.99;
'literature,': 0.99;
'assortment': 0.99;
'headquarters': 0.99;
'postal': 0.99;
'subject:confirmation': 0.99;
'ministry,': 0.99;
'conducting':
'ceremonies': 0.99;
'orders': 0.99;
'ordination.': 0.99;
'shopping': 0.99;
'email name:sales':
'ordination': 0.99;
'ministry.': 0.99;
'andrews': 0.99;
'baptisms': 0.99;
'booklet': 0.99;
'from:"mr.': 0.99;
'california.': 0.99;
'comply': 0.99;
'price,': 0.99;
'minister': 0.99;
'materials': 0.99;
'ordained': 0.99;
'deliveries': 0.99;
'officiating': 0.99;
'funerals,': 0.99;
'subject:Welcome': 0.99;
'shipping*': 0.99;
'rites': 0.99

Now if you've got one user who sucked for a minister-by-mail scam, training
a classifier to view this as ham is going to let similar scams through to
all your users.

> All default values in my .ini file.  That is to say, I have no .ini file
> :)  I'll go over unread messages and see if you need any other tests run.
> Sorry I've been slackin'.

Don't use SpamHam1.pik with the current codebase; I'll delete that file now
so nobody else falls into this trap.