[spambayes-dev] Anybody still have a test ham/spam database?

Tue Jul 10 22:57:24 EDT 2018

[Skip Montanaro]

> > Sure, but constructing a suitable ham/spam corpus
> from scratch is a non-trivial task, as you no doubt
> remember.

Ah - but we had a much subtler task then:  trying to construct a classifier
that was _useful_.  Your current task is much clearer:

> ... I am looking to insure that a Py3 port of SpamBayes
> works the same as the Py2 code.

For _that_ purpose, you can take any pile of email at all; split it into
"ham" and "spam" at random, and "just" ensure you get the same results from
the older and newer code.  Your criterion for success isn't "closeness to
human value judgment", but "same output".

For that purpose, you could synthesize gibberish email from random header &
sentence generators.  Although it would be easier to use real email ;-)
The point is that you don't have to worry at all about whether this or that
is "really ham" or "really spam" or "really unsure" - it was making those
value judgments that consumed lots of human time when building the old
curated data sets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/spambayes-dev/attachments/20180710/8e0ae853/attachment-0001.html>