[Spambayes] Results of playing with CDB

Mon, 16 Sep 2002 19:30:05 -0400

[Neale Pickett]
> A jargon-free mailbox or two seems a good proving ground for whether or
> not an all-users classifier is feasable.

I'm afraid the only way to test whether an all-users classifier is usable is
to run tests on an all-users classifier.  When we have enough testers *here*
set up with reasonably trained classifiers on their own mail, we could
combine them into one classifier (classifier merge code hasn't been written
yet but is an easy task), then everyone can run their tests again with the
combined classifier.  That would give us good clues.

>> Now if you've got one user who sucked for a minister-by-mail
>> scam, training a classifier to view this as ham is going to let
>> similar scams through to all your users.

> That's exactly what it was, someone forwarded to me a message certifying
> them as a minister.  My classifier database scores it as ham (p=0.00) on
> the following words:
>
>         '$33.90': 0.01;

LOL!

>         'california,': 0.01;
>         'california!': 0.01;
>         "minister's": 0.01;
>         'divinity': 0.01;
>         'church,': 0.01;
>         'funerals,': 0.01;
>         'ministers),': 0.01;
>         'ministers).': 0.01;
>         're-ordained.': 0.01;
>         'wedding': 0.01;
>         'hardbound': 0.01;
>         'deliveries': 0.01;
>         'ordained': 0.01;
>         'pass,': 0.01;
>         'processed,': 0.01
>
> Of course, that's because I *trained* it on this message (among others).
> But what's interesting is that two of the big tip-offs from SpamHam1.pik
> ("ordained": 0.99, "funerals,": 0.99) show up as strong ham indicators
> in my database.

They likely appeared only in the message you trained it on, telling the
system it was ham.  In the very much larger spam collection I trained the
.pik file on, they appear in several spams -- but remember that you didn't
train using my spam collection.

> Doesn't 0.01 mean it's never been seen as spam?

Either that, or that it's been seen substantially more often in ham than in
spam.  Intuitively, at least 100x more often, but that's not the real truth
because of the distorting effects of HAMBIAS.

> I know it's not a lot of words, but I wonder if this is evidence that
> the character of people's spam is just as individual as the character of
> their ham.  That would point toward spammers doing more targetting than
> I thought they were.

I expect it's evidence that minister-by-email spams are relatively rare,
that I picked some up because my spam corpus is relatively large, and that
you didn't pick some up because yours is relatively small.  A classifier
knows nothing about the world apart from what it's been trained on, of
course.