[Spambayes] Moving closer to Gary's ideal
Greg Ward
gward@python.net
Mon, 23 Sep 2002 10:15:19 -0400
On 23 September 2002, Guido van Rossum said:
> > One possibility is to use the spam half of the corpus I gathered last
> > week.
>
> How many msgs by now?
Here are the all-but-final stats:
dsn 1658 messages 8654 kB
spam 1887 messages 16640 kB
ham 3840 messages 11342 kB
virus 988 messages 120754 kB
(dsn = "delivery status notification", aka bounces and delay notifications)
This is from harvesting for one week (2002-09-11 to 2002-09-18), with
some manual sorting and cleaning. Also, I deleted 90% of the DSNs at
random, because they were by far the largest source of email -- so the
1658 there is just 10% of what was actually received. No doubt test
runs will reveal more FPs and FNs, but I've done all the manual
cleaning/sorting I can stand to do for now. (Hence "all-but-final"
stats.)
Those of you with logins on mail.python.org can find the corpus in
/usr/local/var/harvest; there are four Maildir folders there. (Hint:
look in spam/cur for the spam. Or just run "mutt -f spam" -- mutt groks
Maildir fine.)
The original, untouched, uncleaned harvest is in
/usr/local/var/harvest-20020911-20020918.tar.gz -- that's just under 300
MB of email.
I'd like to put some tarballs on the mail.python.org web server, but
password-protected -- I want to know who has access to this corpus.
This is not so important if I only put the spam and viruses on the web,
but the ham folder contains some semi-private stuff (eg. a bit of
postmaster traffic, posts to yours and Barry's "extracurricular
activity" lists [dc-ci, dc-bass, etc.]). And the dsn folder reveals all
sorts of stuff about who's subscribed to python.org lists. There also
seems to be some personal email in the ham -- not sure how that snuck
in, since I was careful to detect and *not* save it. Haven't
investigated yet. Anyways, at least take a look at the ham and spam
folders and lemme know what you think. If you spot FPs or FNs let me
know. Ditto if you want to delete anything.
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
All the world's a stage and most of us are desperately unrehearsed.