[Spambayes] Moving closer to Gary's ideal

Greg Ward gward@python.net
Mon, 23 Sep 2002 10:15:19 -0400


On 23 September 2002, Guido van Rossum said:
> > One possibility is to use the spam half of the corpus I gathered last
> > week.
> 
> How many msgs by now?

Here are the all-but-final stats:

  dsn                   1658 messages      8654 kB
  spam                  1887 messages     16640 kB
  ham                   3840 messages     11342 kB
  virus                  988 messages    120754 kB

(dsn = "delivery status notification", aka bounces and delay notifications)

This is from harvesting for one week (2002-09-11 to 2002-09-18), with
some manual sorting and cleaning.  Also, I deleted 90% of the DSNs at
random, because they were by far the largest source of email -- so the
1658 there is just 10% of what was actually received.  No doubt test
runs will reveal more FPs and FNs, but I've done all the manual
cleaning/sorting I can stand to do for now.  (Hence "all-but-final"
stats.)

Those of you with logins on mail.python.org can find the corpus in
/usr/local/var/harvest; there are four Maildir folders there.  (Hint:
look in spam/cur for the spam.  Or just run "mutt -f spam" -- mutt groks
Maildir fine.)

The original, untouched, uncleaned harvest is in
/usr/local/var/harvest-20020911-20020918.tar.gz -- that's just under 300
MB of email.

I'd like to put some tarballs on the mail.python.org web server, but
password-protected -- I want to know who has access to this corpus.
This is not so important if I only put the spam and viruses on the web,
but the ham folder contains some semi-private stuff (eg. a bit of
postmaster traffic, posts to yours and Barry's "extracurricular
activity" lists [dc-ci, dc-bass, etc.]).  And the dsn folder reveals all
sorts of stuff about who's subscribed to python.org lists.  There also
seems to be some personal email in the ham -- not sure how that snuck
in, since I was careful to detect and *not* save it.  Haven't
investigated yet.  Anyways, at least take a look at the ham and spam
folders and lemme know what you think.  If you spot FPs or FNs let me
know.  Ditto if you want to delete anything.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
All the world's a stage and most of us are desperately unrehearsed.