[Spambayes] deleting "duplicate" spam before training? good idea orbad?

Greg Ward gward@python.net
Mon, 9 Sep 2002 15:25:42 -0400


On 09 September 2002, Tim Peters said:
> > Would people be interested in the script?  I'd be happy to extricate
> > it from my local modules and check it into CVS.
> 
> Sure!  I think it's relevant, but maybe for another purpose.  Paul Svensson
> is thinking harder about real people <wink> than the rest of us, and he may
> be able to get use out of approaches that identify closely related spam.
> For example, some amount of spam is going to end up in the ham training data
> in real life use, and any sort of similarity score to a piece of known spam
> may be an aid in finding and purging it.

OTOH, look into DCC (Distributed Checksum Clearinghouse,
http://www.rhyolite.com/anti-spam/dcc/), which uses fuzzy checksums.
It's quite likely that DCC's checksumming scheme is better than
something any of us would throw together for personal use (no offense,
Skip!).  But I have no personal experience of it.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
If it can't be expressed in figures, it is not science--it is opinion.