[Spambayes-checkins] website background.ht,1.9,1.10 docs.ht,1.6,1.7

Thu Jan 16 20:14:40 EST 2003

Update of /cvsroot/spambayes/website
In directory sc8-pr-cvs1:/tmp/cvs-serv23499/website

Modified Files:
	background.ht docs.ht 
Log Message:
Added some words.


Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** background.ht	15 Jan 2003 03:37:02 -0000	1.9
--- background.ht	17 Jan 2003 04:14:38 -0000	1.10
***************
*** 32,35 ****
--- 32,47 ----
  to make little or no difference.</p>
  
+ <p>Because the original tests of the system mixed a ham corpus from
+ a high-volume mailing list with a spam corpus from a different source,
+ email header lines were ignored completely at first (they contained too
+ many consistent clues about which source a message came from).  As a
+ result, this project tried much harder than most to find ways to extract
+ useful information from message bodies.  For example, special
+ tokenizing of embedded URLs was one of the first things tried, and
+ instantly cut the false negative rate in half.  In the end, testing
+ showed that very good classifiers can be gotten by looking only at
+ message bodies, or by looking only at message headers.  Looking at both
+ does best, of course.</p>
+ 
  <!-- correlation of clues, ... -->
  
***************
*** 112,115 ****
--- 124,139 ----
  unsure messages vs possible false positives or negatives. In the chi-squared
  results, the "unsure" window can be quite large, and still result in very small numbers of "unsure" messages. </P>
+ 
+ <p>A remarkable property of chi-combining is that people have generally
+ been sympathetic to its "Unsure" ratings:  people usually agree that
+ messages classed Unsure really are hard to categorize.  For example,
+ commercial HTML email from a company you do business with is quite likely
+ to score as Unsure the first time the system sees such a message from
+ a particular company.  Spam and commercial email both use the language
+ and devices of advertising heavily, so it's hard to tell them apart.
+ Training quickly teaches the system all sorts of things about the
+ commerical email you want, though, ranging from which company sent it
+ and how they addressed you, to the kinds of products and services it's
+ offering.</p>
  
  <h3>Training</h3>

Index: docs.ht
===================================================================
RCS file: /cvsroot/spambayes/website/docs.ht,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** docs.ht	16 Jan 2003 01:54:49 -0000	1.6
--- docs.ht	17 Jan 2003 04:14:38 -0000	1.7
***************
*** 41,45 ****
  document or corpus. (plural is hapax legomena)
  <dt>training<dd>The process of feeding spambayes some sample spam and ham messages, to teach it what to look for.
! <dt>bayesian<dd>A form of statistical analysis used (in a form) in Paul Graham's
  initial "Plan for Spam" approach. Now used as a kind of catch-all term for this class of filters, no doubt horrorifying statisticians everywhere.
  </dl>
--- 41,45 ----
  document or corpus. (plural is hapax legomena)
  <dt>training<dd>The process of feeding spambayes some sample spam and ham messages, to teach it what to look for.
! <dt>Bayesian<dd>A form of statistical analysis used (in a form) in Paul Graham's
  initial "Plan for Spam" approach. Now used as a kind of catch-all term for this class of filters, no doubt horrorifying statisticians everywhere.
  </dl>