[Spambayes-checkins] website background.ht,1.9,1.10 docs.ht,1.6,1.7
Tim Peters
tim_one at users.sourceforge.net
Thu Jan 16 20:14:40 EST 2003
Update of /cvsroot/spambayes/website
In directory sc8-pr-cvs1:/tmp/cvs-serv23499/website
Modified Files:
background.ht docs.ht
Log Message:
Added some words.
Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** background.ht 15 Jan 2003 03:37:02 -0000 1.9
--- background.ht 17 Jan 2003 04:14:38 -0000 1.10
***************
*** 32,35 ****
--- 32,47 ----
to make little or no difference.</p>
+ <p>Because the original tests of the system mixed a ham corpus from
+ a high-volume mailing list with a spam corpus from a different source,
+ email header lines were ignored completely at first (they contained too
+ many consistent clues about which source a message came from). As a
+ result, this project tried much harder than most to find ways to extract
+ useful information from message bodies. For example, special
+ tokenizing of embedded URLs was one of the first things tried, and
+ instantly cut the false negative rate in half. In the end, testing
+ showed that very good classifiers can be gotten by looking only at
+ message bodies, or by looking only at message headers. Looking at both
+ does best, of course.</p>
+
<!-- correlation of clues, ... -->
***************
*** 112,115 ****
--- 124,139 ----
unsure messages vs possible false positives or negatives. In the chi-squared
results, the "unsure" window can be quite large, and still result in very small numbers of "unsure" messages. </P>
+
+ <p>A remarkable property of chi-combining is that people have generally
+ been sympathetic to its "Unsure" ratings: people usually agree that
+ messages classed Unsure really are hard to categorize. For example,
+ commercial HTML email from a company you do business with is quite likely
+ to score as Unsure the first time the system sees such a message from
+ a particular company. Spam and commercial email both use the language
+ and devices of advertising heavily, so it's hard to tell them apart.
+ Training quickly teaches the system all sorts of things about the
+ commerical email you want, though, ranging from which company sent it
+ and how they addressed you, to the kinds of products and services it's
+ offering.</p>
<h3>Training</h3>
Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** docs.ht 16 Jan 2003 01:54:49 -0000 1.6
--- docs.ht 17 Jan 2003 04:14:38 -0000 1.7
***************
*** 41,45 ****
document or corpus. (plural is hapax legomena)
<dt>training<dd>The process of feeding spambayes some sample spam and ham messages, to teach it what to look for.
! <dt>bayesian<dd>A form of statistical analysis used (in a form) in Paul Graham's
initial "Plan for Spam" approach. Now used as a kind of catch-all term for this class of filters, no doubt horrorifying statisticians everywhere.
</dl>
--- 41,45 ----
document or corpus. (plural is hapax legomena)
<dt>training<dd>The process of feeding spambayes some sample spam and ham messages, to teach it what to look for.
! <dt>Bayesian<dd>A form of statistical analysis used (in a form) in Paul Graham's
initial "Plan for Spam" approach. Now used as a kind of catch-all term for this class of filters, no doubt horrorifying statisticians everywhere.
</dl>
More information about the Spambayes-checkins
mailing list