[Spambayes] aging information
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Mon Feb 17 10:13:26 EST 2003
2/17/2003 9:57:23 AM, "D. R. Evans" <N7DR at arrisi.com> wrote:
>On 17 Feb 2003 at 9:38, Tim Stone - Four Stones Expressions wrote:
>
>> 2/17/2003 9:30:50 AM, "D. R. Evans" <N7DR at arrisi.com> wrote:
>>
>> >Does spambayes have any concept that "the older information is, the
>> >less value it has"?
>>
>> There was a huge discussion about this topic toward the end of the
>> research phase of the project, maybe about october last year... At that
>
>Is this discussion easily retrievable from anywhere?
Yes, the archive of this list is available at
http://mail.python.org/mailman/listinfo/spambayes
>
>> guys have a better memory than me. But I think that it revolved around
>> the idea that while the overall content and organization of spam
>> certainly will evolve, the tokens (e.g. words) that are used in spam
>> come from basically a finite set, and don't evolve in the same way that
>> combinations of tokens (spam) evolve. Since spambayes is completely
>
>At first blush, that seems to me to fail to take into account the fact
>that the end-user's notion of what constitutes spam might reasonably
>change as a function of time.
Yes, this is a 'side effect'. For example, my current training classifies
this 'buy gold, beat the market' mail as spam. But now I've become interested
in investing in gold, and I'd really like to see those mails. There are a
couple of strategies for retraining your database. One is to be sure to train
on all "mistakes," or mis-classifications. In other words, don't simply
ignore your spam folder. Browse it every so often, and do training based on
what's there, right or wrong. As you reclassify 'buy gold' mail in your spam
folder, the database will learn your new view on this mail, rather quickly,
most likely.
The other strategy is to completely retrain your database from scratch, after
reorganizing your saved spam and ham mails to reflect your current value
system. This is a bit more work, but will yield immediate results.
Aging is a very difficult problem, because spambayes simply keeps track of
tokens and the number of times you've said that mail with each token is spam
and ham. That's all the information we retain about tokens. We could do some
aging stuff if we add 'date trained' as part of the token key, but that would
result in a database size explosion, severely degrading performance,
increasing the system's complexity, and making the footprint unacceptably
huge. But without that information, a meaningful aging mechanism is not
possible. So we've enabled 'retraining' a particular mail, or set of mails,
which safely shifts the database's learning, while keeping the database
manageable. Make sense?
> I can see that I'm going to have to learn
>python and then try to understand the spambayes code so that I can try
>to add this myself, just to see if it really is useful.
Python is a surprisingly easy language to learn, and even easier to read. :)
>
>Time, time, time... does anyone have any for sale?
Lemme know if you find any... <wink> - TimS
>
> Doc Evans
>--------------------------------------------------------------
>Phone: +1 303 494 0394
>Mobile: +1 720 839 8462
>Fax: +1 781 240 0527
>--------------------------------------------------------------
>
>
>
c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org
More information about the Spambayes
mailing list