<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML xmlns="http://www.w3.org/TR/REC-html40" xmlns:o =
"urn:schemas-microsoft-com:office:office" xmlns:w =
"urn:schemas-microsoft-com:office:word"><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2800.1400" name=GENERATOR>
<STYLE>@page Section1 {size: 8.5in 11.0in; margin: 1.0in 1.25in 1.0in 1.25in; }
P.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman"
}
LI.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman"
}
DIV.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman"
}
H2 {
FONT-SIZE: 18pt; MARGIN-LEFT: 0in; COLOR: green; MARGIN-RIGHT: 0in; FONT-FAMILY: "Times New Roman"; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto
}
A:link {
COLOR: blue; TEXT-DECORATION: underline
}
SPAN.MsoHyperlink {
COLOR: blue; TEXT-DECORATION: underline
}
A:visited {
COLOR: purple; TEXT-DECORATION: underline
}
SPAN.MsoHyperlinkFollowed {
COLOR: purple; TEXT-DECORATION: underline
}
CODE {
FONT-FAMILY: "Courier New"
}
PRE {
FONT-SIZE: 10pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Courier New"
}
TT {
FONT-FAMILY: "Courier New"
}
SPAN.EmailStyle21 {
mso-style-type: personal-compose
}
DIV.Section1 {
page: Section1
}
</STYLE>
</HEAD>
<BODY lang=EN-US vLink=purple link=blue>
<DIV dir=ltr align=left><FONT face=Verdana color=#0000ff size=2><SPAN
class=428053419-02072004>What's most likely causing this is the imbalance in
your training. SpamBayes is most accurate if you can train on approximately the
same number of ham messages as you do spam messages. A ratio of up to 5 to 1 or
so is probably fine, but your ratio is currently about 44 to 1 towards spam
which will heavily bias all your results towards ham.</SPAN></FONT></DIV>
<DIV dir=ltr align=left><FONT face=Verdana color=#0000ff size=2><SPAN
class=428053419-02072004></SPAN></FONT> </DIV>
<DIV dir=ltr align=left><FONT face=Verdana color=#0000ff size=2><SPAN
class=428053419-02072004>For example, the token "christianity" appears 10 times
in ham and 7 times in spam, roughly the same number of times. However, the spam
probability of that token is only .028 because the most basic component of
the statistics on which SpamBayes is based is the percentage of
messages that contain the token. This token appears in 10 out of 140 ham
messages for a ham percentage of 7.14%, and it appears in 7 out of 6168 spam
messages for a spam percentage of only 0.11%. The ham percentage is almost 63x
larger than the spam percentage.</SPAN></FONT></DIV>
<DIV dir=ltr align=left><FONT face=Verdana color=#0000ff size=2><SPAN
class=428053419-02072004></SPAN></FONT> </DIV>
<DIV dir=ltr align=left><FONT face=Verdana color=#0000ff size=2><SPAN
class=428053419-02072004>With an imbalance this large, your best bet is probably
to delete your training data and train again from scratch. Try starting out
without feeding SpamBayes any existing messages for initial training, and then
train only on mistakes and unsures. If you see several spam messages in your
unsure folder that look similar, try training on only one of them and deleting
the rest to avoid training on too many spams.</SPAN></FONT></DIV>
<DIV><FONT face=Verdana color=#0000ff size=2></FONT> </DIV>
<DIV align=left><FONT face=Verdana size=2>-- </FONT></DIV>
<DIV align=left><FONT face=Verdana size=2>Kenny Pitt</FONT></DIV>
<DIV><FONT face=Verdana color=#0000ff size=2></FONT> </DIV><FONT
face=Verdana size=2></FONT><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> spambayes-dev-bounces@python.org
[mailto:spambayes-dev-bounces@python.org] <B>On Behalf Of </B>G. Waleed
Kavalec<BR><B>Sent:</B> Friday, July 02, 2004 1:51 PM<BR><B>To:</B>
spambayes-dev@python.org<BR><B>Subject:</B> [spambayes-dev] Spam Clues:
<>< STOP! Looking for anti christianchristians<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV class=Section1>
<H2><B><FONT face="Times New Roman" color=blue size=2><SPAN
style="FONT-SIZE: 10pt; COLOR: blue">This thing won’t
die.<o:p></o:p></SPAN></FONT></B></H2>
<H2><B><FONT face="Times New Roman" color=blue size=2><SPAN
style="FONT-SIZE: 10pt; COLOR: blue">It doesn’t even go to
‘maybe’.<o:p></o:p></SPAN></FONT></B></H2>
<H2><B><FONT face="Times New Roman" color=blue size=2><SPAN
style="FONT-SIZE: 10pt; COLOR: blue">“What’s up with
that?”<o:p></o:p></SPAN></FONT></B></H2>
<H2><B><FONT face="Times New Roman" color=green size=5><SPAN
style="FONT-SIZE: 18pt"><o:p> </o:p></SPAN></FONT></B></H2>
<H2><B><FONT face="Times New Roman" color=green size=5><SPAN
style="FONT-SIZE: 18pt">Combined Score: 0%
(3.16545e-005)<o:p></o:p></SPAN></FONT></B></H2>
<P class=MsoNormal><FONT face="Times New Roman" size=3><SPAN
style="FONT-SIZE: 12pt">Internal ham score (</SPAN></FONT><TT><FONT
face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt">*H*</SPAN></FONT></TT>):
1<BR>Internal spam score (<TT><FONT face="Courier New" size=2><SPAN
style="FONT-SIZE: 10pt">*S*</SPAN></FONT></TT>): 6.3309e-005<BR><BR># ham
trained on: 140<BR># spam trained on: 6168<o:p></o:p></P>
<H2><B><FONT face="Times New Roman" color=green size=5><SPAN
style="FONT-SIZE: 18pt">150 Significant Tokens<o:p></o:p></SPAN></FONT></B></H2><PRE><STRONG><B><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'">token spamprob #ham #spam<o:p></o:p></SPAN></FONT></B></STRONG></PRE><PRE><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt">'religions' 0.027636 9 5<o:p></o:p></SPAN></FONT></PRE><PRE><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt">'christianity' 0.0281306 10 7<o:p></o:p></SPAN></FONT></PRE><PRE><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt">'jesus,' 0.0281306 10 7<o:p></o:p></SPAN></FONT></PRE><PRE><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt">'religion,' 0.0282139 12 10<o:p></o:p></SPAN></FONT></PRE><PRE><CODE><FONT face="Courier New" size=2><SPAN style="FONT-SIZE: 10pt"></SPAN></FONT></CODE><o:p></o:p> </PRE></DIV></BODY></HTML>