<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">


<HTML><HEAD>


<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>


<META content="MSHTML 5.00.2920.0" name=GENERATOR>


<STYLE></STYLE>


</HEAD>


<BODY bgColor=#ffffff>


<DIV><FONT face=Arial size=2>Hello,</FONT></DIV>


<DIV> </DIV>


<DIV><FONT face=Arial size=2>Nice discussion !!!!!!</FONT></DIV>


<DIV> </DIV>


<DIV><FONT face=Arial size=2>The 'algorythm'  that you describe can use 


this : <A 


href="http://www.awaretek.com/python.html">http://www.awaretek.com/python.html</A> ?</FONT></DIV>


<DIV><FONT face=Arial size=2>If you take a look at this, let me know and please 


envolve me ! </FONT></DIV>


<DIV><FONT face=Arial size=2></FONT> </DIV>


<DIV><FONT face=Arial size=2>Seb<BR>  </FONT></DIV>


<BLOCKQUOTE 


style="BORDER-LEFT: #000000 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px; PADDING-LEFT: 5px; PADDING-RIGHT: 0px">


  <DIV>"Ken Seehof" <<A 


  href="mailto:kens@sightreader.com">kens@sightreader.com</A>> wrote in 


  message <A 


  href="news:mailman.987656068.4191.python-list@python.org">news:mailman.987656068.4191.python-list@python.org</A>...</DIV>


  <DIV><FONT face=Arial size=2>"Dan Maas" <<A 


  href="mailto:dmaas@nospam.dcine.com">dmaas@nospam.dcine.com</A>> 


  says:<BR><BR>> > I've been saving up all the spam messages I get for the 


  past two months.<BR>> > I have about 1869 spam messages saved.<BR>> 


  > Now I'd like to develop a neural net based filter for my email 


  program<BR>> > and train it to recognize these messages as 


  spam.<BR>><BR>> Cool... I assume the main thing you are worrying about 


  is accidentally<BR>> rejecting non-spam emails, which might happen too 


  easily with a<BR>> naive keyword-based system.<BR>><BR>> How about 


  this - apply a whole set of tests to the message. Each test<BR>> gives a 


  "spammness" score - e.g. 10 points for being all caps, 50 points<BR>> for 


  having the word 'viagara', 100 points for having a suspicious From:<BR>> 


  address like *@yahoo.com. Add the scores from the different tests, and<BR>> 


  if the sum exceeds, say, 200 points, then call it "spam."<BR>><BR>> So, 


  how do you figure out a good value for each test score? This is where<BR>> 


  you could use a neural network or genetic algorithm. Pick a set of<BR>> 


  scores, feed the program lots of messages (both spam and non-spam), 


  and<BR>> see how accurate it is. Iterate until it rejects every spam email 


  and<BR>> accepts every non-spam...<BR>><BR>> Dan<BR>> --<BR>> 


  <A 


  href="http://mail.python.org/mailman/listinfo/python-list">http://mail.python.org/mailman/listinfo/python-list</A><BR><BR>Excellent 


  idea, Dan.  That's conveniently sidesteps the most difficult<BR>issue: 


  getting the neural network to actually come up with linguistic<BR>rules.  


  Once an intelligent human specifies the set of rules, the neural<BR>net should 


  have no difficulty coming up with an optimal non-linear<BR>function of 


  pre-processed features (i.e. the "rules") to identify spam.<BR>Analysis of the 


  weights after training will help remove rules that turn<BR>out to be 


  irrelevant.<BR><BR>In other words, the input vector is simply the results from 


  your<BR>arbitrary rule set.<BR><BR>Since irrelevant rules are fairly harmless 


  (other than decreasing<BR>performance), one could initialize it to include a 


  rule for every word<BR>that occurs in spam messages more often than in 


  non-spam<BR>messages.  Then supplement it with rules like the ones you 


  mention.<BR><BR>Here's another idea for acquiring sample data.  Send 


  'please send<BR>me more info' messages to everyone who has sent you spam, 


  with<BR>your newly created spam recipient email address.  Your 


  address<BR>will probably be sold to everyone.  BTW, make sure your 


  spam<BR>recipient is on an ISP that does -not- defend against 


  spam!<BR><BR>(Technically, it's not actually spam you'd be receiving since 


  you<BR>are explicitly requesting it, but close enough :-)<BR><BR>I want to be 


  involved in this project.  Let's take this offline.<BR><BR>- 


  Ken<BR>----------------------------------------------------<BR>Copyright (c) 


  2001 by Ken Seehof<BR>This document may not be distributed, 


  copied,<BR>duplicated, or replicated, or duplicated in any<BR>form without 


  express permission by Ken Seehof.<BR>Permission is hereby granted.<BR><A 


  href="mailto:kseehof@neuralintegrator.com">kseehof@neuralintegrator.com</A><BR>----------------------------------------------------<BR>The 


  opinions expressed herein are not necessarily<BR>those of George W. 


  Bush.<BR><BR><BR><BR></FONT></DIV></BLOCKQUOTE></BODY></HTML>