[Spambayes] Sequemtial Test Results

Sun, 06 Oct 2002 02:42:23 -0700 (PDT)

On 06-Oct-02 Tim Peters wrote:
> Thanks for sharing this!  It's an excellent report.

Thanks for taking time to reply - I realize I'm somewhat off-topic
here.

>> I have a very unusual corpus of ham and spam compared
>> to "normal", so these results may not be widely
>> applicable.

> Could you say something about *what* makes you abnormal <wink>?

You mean besides coding Python?

Spam: over 50% is Asian language; includes virus msgs (raw or ISP
scrubbed/tagged); business related or industry specific spam; plus
all of the other usual kinds of spam. A lot of virus msgs (raw
or scrubbed/tagged by an ISP or firewall filtering) - my favorite
was "I insult your mother".

Ham: Over 3/4 is lists (some long) of part numbers/quantities/some
other info (eg TMS320P14FNL  1000  99DC $10.00) or related
correspondence (quotes, RFQs, inquiries, etc), some with a small
amount of Asian text mixed in. Otherwise newsletters, mailing
lists, personal stuff (small amount).

> Excellent -- nobody here has done that yet (that I know of), and
> I've worried out loud about that randomization allows msgs to get
> benefit from training msgs that appeared *after* them in time;
> e.g., a ham msg can be helped by that a reply to it appeared in
> the training ham, but that can never happen in real life.

It seems the opposite is true - my results were worse before (0.5%
to 1.0% failures or worse). You might have already acheived
perfection and not know it due to randomization :) It appears that
the systems both learn gradually. For example, one of my ISPs
started virus filtering at a point after the initial training data,
and that produced problems in the past training on N msgs then
testing the next M without any retraining. That didn't occur here.
Some other hard to filter msgs (again, for both methods) also didn't
fail.

> I'm not sure what these are results *of* -- like, the last time
> you ran step>#2?  An average over all times you ran step #2?

Total results for testing 14400 msgs in batches of 200 (and
training after each 200) - failures against (virtual) cutoff
setting.

>                           Graham
>                        Spam    Ham
> Mean                   0.98    0.01

> And these are the means of what?  For example, there's no
> false-negative rate as large as 0.98 in the table above, so 0.98
> certainly isn't the mean of the table entries.

Mean/std deviation for scores of all msgs tested.

>> Std Dev                0.04    0.02
>> 3 sigma                0.86    0.07

>> 1. Word freq threshhold = 1 instead of 5

> That helped us a lot when we were using Graham.

>> 2. Case sensitive tokeninzing

> That did not (made no overall difference in error rates; it
> systematically called conference announcements spam, but was
> better at distinguishing spam screaming about MONEY from casual
> mentions of money in ham).

Everything made a *small* difference - I'm really quite surprised
everything lined up in the same direction for once. I went through
most of the tweaks from scratch one at a time (including some of my
own that I thought were really cool but ultimately didn't work very
well) and what's left is what what worked the best. Finally having
clean samples really helped too.

>> 3. Use Gary Robinson's score calculation

> With or without artificially clamping spamprobs into [0.01, 0.99]
> first (as Graham does)?

Same as Graham. I went back and tried Graham's scoring again too,
and it's only marginally worse than Robinson's (but has the problem
of extreme values of fp & fn). My "Robinson scoring" is just the
S = (P - Q)/(P + Q) kind.

>> 4. Use token count instead of msg count in computing
>> probability.

> We haven't tried that.

It's a programming error (wrong indentation in token processing
loop) that led to better results. Wish I could say I thought of it,
but it makes more sense to me now. Again, it makes a small
difference overall, but has a bigger effect on the shape of the
score distribution in my tests.

>> Counting msgs instead of tokens in computing probability is
>> a fairly subtle bias (noted by Graham in "A Plan for Spam")
>> and is still included in Spambayes.

> Not really.  We currently depart from Graham too in counting
> multiple occurrences of a word only once in both training and
> scoring.  Our hamcounts and spamcounts are counts of the # of
> messages a word appears in now, not counts of the total number
> of times the word appears in msgs (as they were
> under Graham).

Yes - that's what bothers me.

>> If I count msgs instead of tokens I can get about the same
>> results
>> and the mean and std dev are unaffected, but the tails of the
>> distributions for ham/spam scores move closer together (no large
>> dead band as above). Here's why (sort of):
>>
>> The probability calculation is:
>>
>> (s is spam count for a token, h is ham count, H/S are either
>> the number of msgs seen or number of tokens seen)

> I'm not sure what "spam count for a token" means.  For Graham, it
> means the total number of times a token appears in spam,
> regardless of msg boundaries. For us today, it means the number
> of spams in which the token appears (and "Nigeria" appearing 100
> times in a single spam adds only 1 to Nigeria's spam count for
> us; it adds 100 to Graham's Nigeria spam count).  Our error rates
> got lower when we made training symmetric with scoring in this
> respect, although that wasn't true before we purged *all* of the
> deliberate biases in Paul's scheme.

"spam count" means the same as your "spamcount" variable in
update_probabilities - you count once per msg, I count every
occurance in a msg. Making "training symmetric with scoring" is
what seems intuitively incorrect to me, along with nham/nspam being
msg counts instead of token counts. 

If you arbitrarily see the word "fussball" in a wordstream is the
wordstream German or English ("football" in German, a table game
found in bars in US English)? I'd guess German because I'm also
guessing the word occurs with greater frequency in German
wordstreams than in English wordstreams (absent context) - not
because I think more German books contain at least one occurance of
the word compared to English books. On the testing side, if the test
wordstream contained "fussball!fussball!fussball!", would you change
your guess? I'd suggest your guess would still be based on a single
occurance - the repetition doesn't change the probability of which
set the wordstream belongs to. I can't see it would 3X more likely
one way or the other - what else could you conclude then but that
"fussball" and "fussball!fussball!fussball!" have identical
probabilities of being elements of a German wordstream without
some other kind of data? 

>> prob = 1/(1 + (S/H)*(2*h/s))

> Did you keep Graham's ham bias?  We have not.

Yes - again, a small (positive) difference.

> Note that overlapping tails aren't something our default scheme
> tries to eliminate.  It's considered "a feature" here that
> Gary's scheme has a middle ground where mistakes are very likely
> to live.  This is something you learn to love <wink> after
> realizing that mistakes cannot be stopped.

Yes - and if your scores really indicate the actual probability of
spamminess, you can use that info to sort the msgs for manual
review. Given the volume of spam, fatigue is a real problem in
manual review - I wouldn't risk the possibility of fps except that
they're more likely with a manual system (as I found out in sorting
25K msgs semi-manually). I'm actually concerned that if the fp
rate is too low, they're won't be enough reward in reviewing the
results manually - my fps could be very expensive. It appears to me
that perfect results are not obtainable because everyone probably
has msgs that they can't reliably bucket as spam or ham.

> For example, under Graham's scheme, you're *eventually* going to
> find ham that scores 1.0 (and spam that scores 0.0).  For
> example, with 15 discriminators, sooner or later you're going to
> find a ham that just happens to have 8 .99 clues and 7 .01
> clues, and then Graham is certain it's spam.

Happened a lot in other kinds of testing, but not much when testing
sequentially as described above - I have no idea why.

> There's no cutoff value that can save you from this kind of false
> positive, short of never calling anything spam.  When Gary's
> scheme makes a mistake, it's almost always within a short
> distance of the data's best spam_cutoff value.  In a system with
> manual human review, this is very exploitable;

Agree - I should have read ahead before the response above.

> in a system without manual review, I suppose you just pass such
> msgs on, but still have the *possibility* to say clearly that
> the system is known to make mistakes in this range.

>> Nothing I did to Spambayes had much effect on mean/std dev, but
>> did reshape the distribution curves.  I get a lot more tokens
>> than Spambayes,

> ?  What does that mean?  If you're using spambayes, it's
> generating tokens, so it seems hard to get a lot more than that
> <wink>.

My tokenizer consists of a findall on 
re.compile(r"[\w'$_-]+", re.U). I get a lot more tokens than than
spambayes tokenizer produces. "I get" meant my Graham version vs.
spambayes.

>> 3. I'd concentrate on shaping the tails of the distribution
>> rather than worrying about mean and std dev.

> The so-called central-limit schemes we're investigating now are
> almost entirely about separating the tails, and *knowing* when
> we can't, so that should give you cause for hope.

I gathered that from today's list msgs - didn't notice it before.

> OTOH, some ham and some spam simply aren't clearcut, even for
> human judgment, so I see no hope that this can be wholly
> eliminated "even in theory".

Either method does better than I do (or thinks it does at any rate)

>> the fns and fps are out past 3 sigma.  In EE terms, you want
>> sharper rolloff, not necessarily higher Q or a change in center
>> frequency. Graham appears to be less sensitive to choice of
>> cutoff than Spambayes for my dataset.

> This was universally observed:  the Graham score histograms
> approximated two solid bars, one at 0.0, the other at 1.0, the
> more data it was trained on. Unfortunately, its *mistakes* also
> lived on these bars.

Yes, but the (P - Q)/(P + Q) scoring fixes that nicely for my data.

> It would take telepathy, and even people on this list argue about
> whether specific msgs are ham or spam.

The computer is always right.

> I've noted before that the chance my classifier
> would produce an FP over the next year is smaller than the
> chance I'll die in that time, and I personally don't fear a
> false positive more than death <wink>.

You haven't met my wife - one persistent fp was from a woman who is
both my wife's best friend and was (and may be again) our best
customer. I suppose that's what whitelists are for.

Jim