[Spambayes] test sets?
Fri, 6 Sep 2002 09:44:17 -0400
On 05 September 2002, Tim Peters said:
> Greg Ward is
> currently capturing a stream coming into python.org, and I hope we can get a
> more modern, and cleaner, test set out of that.
Not yet -- still working on the required config changes. But I have a
> But if that stream contains
> any private email, it may not be ethically possible to make that available.
It will! Part of my cunning plan involves something like this:
if folder == "accepted": # ie. not suspected junk mail
if (len(recipients) == 1 and
recipients in ("email@example.com", "firstname.lastname@example.org", ...)):
folder = "personal"
If you (and Guido, Barry, et. al.) prefer, I could change that last
statement to "folder = None", so the mail won't be saved at all. I
*might* also add a "and sender doesn't look like -bounce-*, -request,
-admin, ..." clause to that if statement.
> Can you think of anyplace to get a large, shareable ham sample apart from a
> public mailing list? Everyone's eager to share their spam, but spam is so
> much alike in so many ways that's the easy half of the data collection
I believe the SpamAssassin maintainers have a scheme whereby the corpus
of non-spam is distributed, ie. several people have bodies of non-spam
that they use for collectively evolving the SA score set. If that
sounds vague, it matches my level of understanding.
Greg Ward <email@example.com> http://www.gerg.ca/
Reality is for people who can't handle science fiction.