[Spambayes] aging information

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Mon Feb 17 10:13:26 EST 2003


2/17/2003 9:57:23 AM, "D. R. Evans" <N7DR at arrisi.com> wrote:

>On 17 Feb 2003 at 9:38, Tim Stone - Four Stones Expressions wrote:
>
>> 2/17/2003 9:30:50 AM, "D. R. Evans" <N7DR at arrisi.com> wrote:
>> 
>> >Does spambayes have any concept that "the older information is, the
>> >less value it has"?
>> 
>> There was a huge discussion about this topic toward the end of the
>> research phase of the project, maybe about october last year... At that
>
>Is this discussion easily retrievable from anywhere? 

Yes, the archive of this list is available at 
http://mail.python.org/mailman/listinfo/spambayes

>
>> guys have a better memory than me.  But I think that it revolved around
>> the idea that while the overall content and organization of spam
>> certainly will evolve, the tokens (e.g. words) that are used in spam
>> come from basically a finite set, and don't evolve in the same way that
>> combinations of tokens (spam) evolve.  Since spambayes is completely
>
>At first blush, that seems to me to fail to take into account the fact 
>that the end-user's notion of what constitutes spam might reasonably 
>change as a function of time.

Yes, this is a 'side effect'.  For example, my current training classifies 
this 'buy gold, beat the market' mail as spam.  But now I've become interested 
in investing in gold, and I'd really like to see those mails.  There are a 
couple of strategies for retraining your database.  One is to be sure to train 
on all "mistakes," or mis-classifications.  In other words, don't simply 
ignore your spam folder.  Browse it every so often, and do training based on 
what's there, right or wrong.  As you reclassify 'buy gold' mail in your spam 
folder, the database will learn your new view on this mail, rather quickly, 
most likely.

The other strategy is to completely retrain your database from scratch, after 
reorganizing your saved spam and ham mails to reflect your current value 
system.  This is a bit more work, but will yield immediate results.

Aging is a very difficult problem, because spambayes simply keeps track of 
tokens and the number of times you've said that mail with each token is spam 
and ham.  That's all the information we retain about tokens.  We could do some 
aging stuff if we add 'date trained' as part of the token key, but that would 
result in a database size explosion, severely degrading performance, 
increasing the system's complexity, and making the footprint unacceptably 
huge.  But without that information, a meaningful aging mechanism is not 
possible.  So we've enabled 'retraining' a particular mail, or set of mails, 
which safely shifts the database's learning, while keeping the database 
manageable.  Make sense?

> I can see that I'm going to have to learn 
>python and then try to understand the spambayes code so that I can try 
>to add this myself, just to see if it really is useful.

Python is a surprisingly easy language to learn, and even easier to read.  :)

>
>Time, time, time... does anyone have any for sale?

Lemme know if you find any... <wink> - TimS

>
>  Doc Evans
>--------------------------------------------------------------
>Phone:  +1 303 494 0394
>Mobile: +1 720 839 8462
>Fax:    +1 781 240 0527
>--------------------------------------------------------------
>
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org







More information about the Spambayes mailing list