[spambayes-dev] spammy subject lines

Mon Oct 13 11:42:53 EDT 2003

[Tony Meyer]
> My results, for the mail I am currently using in Outlook, with
> timcv.py -n3 are:
>
> -> <stat> tested 112 hams & 1506 spams against 221 hams & 2888 spams
> -> <stat> tested 96 hams & 1423 spams against 237 hams & 2971 spams
> -> <stat> tested 125 hams & 1465 spams against 208 hams & 2929 spams
> -> <stat> tested 112 hams & 1506 spams against 221 hams & 2888 spams
> -> <stat> tested 96 hams & 1423 spams against 237 hams & 2971 spams
> -> <stat> tested 125 hams & 1465 spams against 208 hams & 2929 spams
> filename:     octs   subjs
> ham:spam:  333:4394
>                    333:4394
> fp total:        1       1
> fp %:         0.30    0.30
> fn total:      165     162
> fn %:         3.76    3.69
> unsure t:      528     526
> unsure %:    11.17   11.13

That's an extraordinarily high unsure %.  Do you normally see such a high
rate?  The FN rate also seems high.

> real cost: $280.60 $277.20
> best cost: $136.20 $134.00

Suggests that the cutoffs are far from optimal.  Score distribution
histograms would reveal more.

> h mean:       1.71    1.81
> h sdev:       8.85    9.06
> s mean:      91.27   91.46
> s sdev:      22.57   22.30
> mean diff:   89.56   89.65
> k:            2.85    2.86
>
> (Note that I happily run a 1:10 ham:spam ratio; I can do tests
> without the imbalance if desired).

Are you also running the mixed unigram/bigram scheme?

> Looks like a slight win for me;

One of the points of a cross-validation run is to get several runs, and see
how many won, lost, and tied.  This very brief summary output hides all that
stuff.  So, e.g., we can't tell whether all 6 runs had a tiny win, or 5 lost
a little and 1 won big, adding up to a tiny overall win.  The smaller the
net effect in the end, the more important to see more details.

> I have no idea how many messages I have that do try and use this sort
> of trick in the subject, since so little spam appears out of my spam
> folder I don't really pay enough attention to it.

I know I've received spam with this kind of ob^fu!sca at tion in the Subject
line, but it doesn't look like I ever trained on one -- meaning all such
spam were identified as spam for me without this change.  I've noticed that
rescoring my trained ham after the change gives a tighter ham-score
distribution, probably because of new highly correlated tokens from mailing
lists.  For example, your message now generates a

    subject:spambayesdev

token in addtion to the

    subject:spambayes
and
    subject:dev

tokens it used to generate, and they're all strong ham clues.