[Spambayes] python.org corpus updated

Greg Ward gward@python.net
Mon Oct 28 17:16:20 2002


On 26 October 2002, Tim Peters said:
> -> <stat> 4 new false positives
>     new fp: ['pyham/02155.txt', 'pyham/01816.txt', 'pyham/02322.txt',
>              'pyham/02406.txt']
> 
> but I believe they're all spam.  I'll attach them for your review.  They
> correspond, respectively, to your

Can't really blame SpamAssassin for missing these -- they were all sent
to a Mailman -request address, which is explicitly whitelisted on
python.org (I don't want to reject unsubscribe requests from people who
happen to be on too many RBLs).  Moved 'em to spam folder.

> -> <stat> 9 new false positives
>     new fp: ['pyham/00277.txt', 'pyham/00278.txt', 'pyham/00275.txt',
>              'pyham/00267.txt', 'pyham/01346.txt', 'pyham/00261.txt',
>              'pyham/00276.txt', 'pyham/01284.txt', 'pyham/00645.txt']
> 
> Again I believe these are all spam, and some are so outrageously spam it's
> hard to believe SpamAssassin let them pass!  Then again, most are in a hated
> language <wink>.
> 
> ham/183BtE-00072Z-00   261
> ham/183DZB-0007dJ-00   267
> ham/183Epz-0001IH-00   275
> ham/183Epz-0001II-00   276
> ham/183Epz-0001IJ-00   277
> ham/183Epz-0001IK-00   278

These should have been dead easy: subject encoded in iso-2022-jp (which
is *now* a banned charset on python.org, but wasn't when this harvest
started), and are "To: a@a.a".  Unfortunately Exim can be made very
picky about addresses in sender headers ("From", "Reply-to", "Sender"),
but I don't think it has anything for rigorous checking of recipient
headers.  Hmmm.

> ham/183aCi-00024k-00   645
> ham/183ueG-0006vd-00  1284
> ham/183xNY-0008Gi-00  1346

These slipped through because they are to "-request" addresses.

> Take those away and there were no false positives in either direction.

Wow, awesome.

> One example:
> 
> spam/183UWS-00060A-00  633
> 
> seems a perfectly ordinary piece of mailman-users traffic.  chi-combining is
> quite certain it's ham:
> 
> prob = 3.37424532759e-012
> prob('*H*') = 1
> prob('*S*') = 6.63913e-012
> 
> OTOH, SpamAssassin seems certain it's spam:

Well, actually, it only scored 5.4.  SA doesn't have any formal notion
of certainty, but I'm pretty comfortable in stating that scores from 3.0
to 10.0 is the informal SA zone of uncertainty.  Blame me: I think I
forgot to manually review low-scoring messages in the spam folder for
FPs.  I'll do that before regenerating the tarballs.

> There also appear to be an awful lot of "false negatives" of the form:
> 
> """
>     This is a message from the IFL E-Mail Virus Protection Service
>     --------------------------------------------------------------
> 
> The original e-mail attachment
> 
>     "Card.DOC.pif"
> 
> appears to be infected by a virus and has been replaced by this=20
> warning message.
> """
> 
> That may be virus fallout, but I don't believe it belongs in the spam
> corpus, right?

Correct -- I usually put all that stuff in the virus folder, because I'd
like to see all virus-related junk mail stopped, and I think it should
be done with different tools from spam detectors.  Again, my fault for
not manually reviewing the spam folder.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
I'm on a strict vegetarian diet -- I only eat vegetarians.