<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD><TITLE>Incremental Training</TITLE>

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

<META content="MSHTML 6.00.2900.2180" name=GENERATOR></HEAD>

<BODY>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2>Incremental training never completely ends, but the number 

of new messages that need training will reduce drastically after a very short 

time.&nbsp;Most people get 95+% accuracy after only a week or two of training on 

mistakes and unsures.&nbsp;However, spammers are constantly advertising new 

scams or modifying their message format to try to get around all the spam 

filters so it is impossible for any filter to be 100% 

accurate.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2></FONT></SPAN>&nbsp;</DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2>Experience has shown that as long as you train only on the 

mistakes and unsures, your database size should remain reasonably 

small.&nbsp;You would likely have to train on thousands of messages before there 

would be any noticeable slow-down in the SpamBayes 

processing.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2></FONT></SPAN>&nbsp;</DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2>This leads into your last question.&nbsp;The thing you want 

to avoid is bulk-training on large numbers of messages, particularly if you are 

training only one type of message such as all spam and no or very few good 

messages. First, it unnecessarily increases the size of your training database. 

Second, it can cause you to have significantly more trained messages of one type 

than you have of the other.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2></FONT></SPAN>&nbsp;</DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2>The theories behind the SpamBayes filter would suggest that 

optimum performance is achieved if the number of good messages trained and the 

number of spam messages trained is about equal. Most people still see excellent 

results if they have trained 5 or even 10 spam messages for every good message. 

If your training gets more one-sided than that, there is a good chance that your 

accuracy will start to decrease. But every user is different and it seems that 

some people are still getting good results with imbalances as&nbsp;high as 100 

to 1 or more.</FONT></SPAN></DIV>

<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana 

color=#0000ff size=2></FONT></SPAN>&nbsp;</DIV>

<DIV align=left><FONT face=Verdana size=2>-- </FONT></DIV>

<DIV align=left><FONT face=Verdana size=2>Kenny Pitt</FONT></DIV>

<DIV><FONT face=Verdana color=#0000ff size=2></FONT>&nbsp;</DIV><FONT 

face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><BR>

<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>

<HR tabIndex=-1>

<FONT face=Tahoma size=2><B>From:</B> Winoto Janputra 

[mailto:winotoj@Dorfin.com] <BR><B>Sent:</B> Friday, September 24, 2004 8:24 

AM<BR><B>To:</B> Kenny Pitt<BR><B>Subject:</B> RE: [Spambayes] Incremental 

Training<BR></FONT><BR></DIV>

<DIV></DIV>

<DIV dir=ltr align=left><FONT face=Arial color=#0000ff 

size=2></FONT>&nbsp;</DIV>

<DIV><SPAN class=925261212-24092004></SPAN><FONT face=Arial><FONT 

color=#0000ff><FONT size=2>Hi&nbsp;Kenny,</FONT></FONT></FONT></DIV>

<DIV><FONT face=Arial><FONT color=#0000ff><FONT 

size=2></FONT></FONT></FONT>&nbsp;</DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2>T<SPAN class=925261212-24092004>hanks for your 

reply.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2><SPAN class=925261212-24092004>I have another question, I know it's 

different for everybody but when I have to stop the incremental training? I'm 

affraid if the database too big will&nbsp;slowdown 

outlook.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2><SPAN class=925261212-24092004>We use ORF at server level but he still 

get around 10 spam everyday.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2><SPAN 

class=925261212-24092004></SPAN></FONT></FONT></FONT></FONT></FONT>&nbsp;</DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2><SPAN 

class=925261212-24092004>~~~~~~~~~~~~~~~~~~~</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>

<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT 

size=2><SPAN class=925261212-24092004><SPAN class=694384820-23092004><FONT 

face=Verdana color=#0000ff size=2>If you are using any of the incremental 

training methods above then there should be no need to manually train on the 

entire contents of your spam folder.&nbsp; In fact, doing so could potentially 

reduce the effectiveness of the SpamBayes filter (for mathematical reasons that 

I won't go into </FONT></SPAN></SPAN></FONT></FONT></FONT></FONT></FONT></DIV>

<DIV><FONT face=Arial color=#0000ff size=2><SPAN 

class=925261212-24092004>~~~~~~~~~~~~~~~~~~</SPAN></FONT></DIV>

<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=925261212-24092004>Which 

one reduce the effectiveness, incremental or rebuild?</SPAN></FONT></DIV>

<DIV><FONT face=Arial color=#0000ff size=2><SPAN 

class=925261212-24092004></SPAN></FONT>&nbsp;</DIV>

<DIV><FONT face=Arial color=#0000ff size=2><SPAN 

class=925261212-24092004>Thanks,</SPAN></FONT></DIV>

<DIV><FONT face=Arial color=#0000ff size=2><SPAN 

class=925261212-24092004>Winoto</SPAN></FONT></DIV>

<DIV><FONT face=Arial color=#0000ff size=2></FONT>&nbsp;</DIV></BODY></HTML>