<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>
<META content="MSHTML 5.00.3105.105" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>"Dan Maas" <<A
href="mailto:dmaas@nospam.dcine.com">dmaas@nospam.dcine.com</A>>
says:<BR><BR>> > I've been saving up all the spam messages I get for the
past two months.<BR>> > I have about 1869 spam messages saved.<BR>>
> Now I'd like to develop a neural net based filter for my email
program<BR>> > and train it to recognize these messages as
spam.<BR>><BR>> Cool... I assume the main thing you are worrying about is
accidentally<BR>> rejecting non-spam emails, which might happen too easily
with a<BR>> naive keyword-based system.<BR>><BR>> How about this -
apply a whole set of tests to the message. Each test<BR>> gives a "spammness"
score - e.g. 10 points for being all caps, 50 points<BR>> for having the word
'viagara', 100 points for having a suspicious From:<BR>> address like
*@yahoo.com. Add the scores from the different tests, and<BR>> if the sum
exceeds, say, 200 points, then call it "spam."<BR>><BR>> So, how do you
figure out a good value for each test score? This is where<BR>> you could use
a neural network or genetic algorithm. Pick a set of<BR>> scores, feed the
program lots of messages (both spam and non-spam), and<BR>> see how accurate
it is. Iterate until it rejects every spam email and<BR>> accepts every
non-spam...<BR>><BR>> Dan<BR>> --<BR>> <A
href="http://mail.python.org/mailman/listinfo/python-list">http://mail.python.org/mailman/listinfo/python-list</A><BR><BR>Excellent
idea, Dan. That's conveniently sidesteps the most difficult<BR>issue:
getting the neural network to actually come up with linguistic<BR>rules.
Once an intelligent human specifies the set of rules, the neural<BR>net should
have no difficulty coming up with an optimal non-linear<BR>function of
pre-processed features (i.e. the "rules") to identify spam.<BR>Analysis of the
weights after training will help remove rules that turn<BR>out to be
irrelevant.<BR><BR>In other words, the input vector is simply the results from
your<BR>arbitrary rule set.<BR><BR>Since irrelevant rules are fairly harmless
(other than decreasing<BR>performance), one could initialize it to include a
rule for every word<BR>that occurs in spam messages more often than in
non-spam<BR>messages. Then supplement it with rules like the ones you
mention.<BR><BR>Here's another idea for acquiring sample data. Send
'please send<BR>me more info' messages to everyone who has sent you spam,
with<BR>your newly created spam recipient email address. Your
address<BR>will probably be sold to everyone. BTW, make sure your
spam<BR>recipient is on an ISP that does -not- defend against
spam!<BR><BR>(Technically, it's not actually spam you'd be receiving since
you<BR>are explicitly requesting it, but close enough :-)<BR><BR>I want to be
involved in this project. Let's take this offline.<BR><BR>-
Ken<BR>----------------------------------------------------<BR>Copyright (c)
2001 by Ken Seehof<BR>This document may not be distributed,
copied,<BR>duplicated, or replicated, or duplicated in any<BR>form without
express permission by Ken Seehof.<BR>Permission is hereby granted.<BR><A
href="mailto:kseehof@neuralintegrator.com">kseehof@neuralintegrator.com</A><BR>----------------------------------------------------<BR>The
opinions expressed herein are not necessarily<BR>those of George W.
Bush.<BR><BR><BR><BR></FONT></DIV></BODY></HTML>