<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Incremental Training</TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2180" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2>Incremental training never completely ends, but the number
of new messages that need training will reduce drastically after a very short
time. Most people get 95+% accuracy after only a week or two of training on
mistakes and unsures. However, spammers are constantly advertising new
scams or modifying their message format to try to get around all the spam
filters so it is impossible for any filter to be 100%
accurate.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2>Experience has shown that as long as you train only on the
mistakes and unsures, your database size should remain reasonably
small. You would likely have to train on thousands of messages before there
would be any noticeable slow-down in the SpamBayes
processing.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2>This leads into your last question. The thing you want
to avoid is bulk-training on large numbers of messages, particularly if you are
training only one type of message such as all spam and no or very few good
messages. First, it unnecessarily increases the size of your training database.
Second, it can cause you to have significantly more trained messages of one type
than you have of the other.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2>The theories behind the SpamBayes filter would suggest that
optimum performance is achieved if the number of good messages trained and the
number of spam messages trained is about equal. Most people still see excellent
results if they have trained 5 or even 10 spam messages for every good message.
If your training gets more one-sided than that, there is a good chance that your
accuracy will start to decrease. But every user is different and it seems that
some people are still getting good results with imbalances as high as 100
to 1 or more.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=773302913-24092004><FONT face=Verdana
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV align=left><FONT face=Verdana size=2>-- </FONT></DIV>
<DIV align=left><FONT face=Verdana size=2>Kenny Pitt</FONT></DIV>
<DIV><FONT face=Verdana color=#0000ff size=2></FONT> </DIV><FONT
face=Verdana size=2></FONT><FONT face=Verdana size=2></FONT><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Winoto Janputra
[mailto:winotoj@Dorfin.com] <BR><B>Sent:</B> Friday, September 24, 2004 8:24
AM<BR><B>To:</B> Kenny Pitt<BR><B>Subject:</B> RE: [Spambayes] Incremental
Training<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<DIV><SPAN class=925261212-24092004></SPAN><FONT face=Arial><FONT
color=#0000ff><FONT size=2>Hi Kenny,</FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT
size=2></FONT></FONT></FONT> </DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2>T<SPAN class=925261212-24092004>hanks for your
reply.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2><SPAN class=925261212-24092004>I have another question, I know it's
different for everybody but when I have to stop the incremental training? I'm
affraid if the database too big will slowdown
outlook.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2><SPAN class=925261212-24092004>We use ORF at server level but he still
get around 10 spam everyday.</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2><SPAN
class=925261212-24092004></SPAN></FONT></FONT></FONT></FONT></FONT> </DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2><SPAN
class=925261212-24092004>~~~~~~~~~~~~~~~~~~~</SPAN></FONT></FONT></FONT></FONT></FONT></DIV>
<DIV><FONT size=+0><FONT size=+0><FONT face=Arial><FONT color=#0000ff><FONT
size=2><SPAN class=925261212-24092004><SPAN class=694384820-23092004><FONT
face=Verdana color=#0000ff size=2>If you are using any of the incremental
training methods above then there should be no need to manually train on the
entire contents of your spam folder. In fact, doing so could potentially
reduce the effectiveness of the SpamBayes filter (for mathematical reasons that
I won't go into </FONT></SPAN></SPAN></FONT></FONT></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN
class=925261212-24092004>~~~~~~~~~~~~~~~~~~</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=925261212-24092004>Which
one reduce the effectiveness, incremental or rebuild?</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN
class=925261212-24092004></SPAN></FONT> </DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN
class=925261212-24092004>Thanks,</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN
class=925261212-24092004>Winoto</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV></BODY></HTML>