python.org corpus (was Re: [Spambayes] Moving closer to Gary's ideal)

Greg Ward gward@python.net
Mon, 23 Sep 2002 12:36:32 -0400


On 23 September 2002, To SpamBayes said:
> There also
> seems to be some personal email in the ham -- not sure how that snuck
> in, since I was careful to detect and *not* save it.  Haven't
> investigated yet.

Ahh, figured it out.  There are a couple of ways personal mail can get
into the 'ham' folder of the email harvest:

  * it was flagged as junk mail by Exim on receipt, but was manually
    moved by me from the 'misc-junk' folder to the 'ham' folder on
    inspection

  * it was flagged by SpamAssassin on receipt, but again was manually
    moved by me from 'caught-spam' to 'ham'

  * it was sent to eg. Guido@python.org instead of guido@python.org

The last one is obviously a silly bug in the harvesting code; I'll fix
that before the next harvesting run.  I'm going to delete those messages
from the ham folder now; there's no reason to keep them since there's no
other genuinely personal email in there.

The other two are trickier.  We really want to harvest spam that, if
ham, would be considered personal email -- not doing so is failing to
utilize all available resources!  But once I look at that so-called spam
and realize that it's an FP, then all of a sudden it's personal email.
*BUT* it's also email that some junk-detection system (my ad-hoc Exim
ACLs, or SpamAssassin) flagged as junk... so it's valuable to keep in
the training data.  Hmmm.  Opinions?

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
No man is an island, but some of us are long peninsulas.