[Spambayes] Moving closer to Gary's ideal

Greg Ward gward@python.net
Mon, 23 Sep 2002 10:15:19 -0400

On 23 September 2002, Guido van Rossum said:
> > One possibility is to use the spam half of the corpus I gathered last
> > week.
> How many msgs by now?

Here are the all-but-final stats:

  dsn                   1658 messages      8654 kB
  spam                  1887 messages     16640 kB
  ham                   3840 messages     11342 kB
  virus                  988 messages    120754 kB

(dsn = "delivery status notification", aka bounces and delay notifications)

This is from harvesting for one week (2002-09-11 to 2002-09-18), with
some manual sorting and cleaning.  Also, I deleted 90% of the DSNs at
random, because they were by far the largest source of email -- so the
1658 there is just 10% of what was actually received.  No doubt test
runs will reveal more FPs and FNs, but I've done all the manual
cleaning/sorting I can stand to do for now.  (Hence "all-but-final"

Those of you with logins on mail.python.org can find the corpus in
/usr/local/var/harvest; there are four Maildir folders there.  (Hint:
look in spam/cur for the spam.  Or just run "mutt -f spam" -- mutt groks
Maildir fine.)

The original, untouched, uncleaned harvest is in
/usr/local/var/harvest-20020911-20020918.tar.gz -- that's just under 300
MB of email.

I'd like to put some tarballs on the mail.python.org web server, but
password-protected -- I want to know who has access to this corpus.
This is not so important if I only put the spam and viruses on the web,
but the ham folder contains some semi-private stuff (eg. a bit of
postmaster traffic, posts to yours and Barry's "extracurricular
activity" lists [dc-ci, dc-bass, etc.]).  And the dsn folder reveals all
sorts of stuff about who's subscribed to python.org lists.  There also
seems to be some personal email in the ham -- not sure how that snuck
in, since I was careful to detect and *not* save it.  Haven't
investigated yet.  Anyways, at least take a look at the ham and spam
folders and lemme know what you think.  If you spot FPs or FNs let me
know.  Ditto if you want to delete anything.

Greg Ward <gward@python.net>                         http://www.gerg.ca/
All the world's a stage and most of us are desperately unrehearsed.