Any Neural Net code in Python? I want to filter out spam email
Sébastien Libert
sebastien.libert at comexis.com
Thu Apr 19 03:51:13 EDT 2001
Hello,
Nice discussion !!!!!!
The 'algorythm' that you describe can use this : http://www.awaretek.com/python.html ?
If you take a look at this, let me know and please envolve me !
Seb
"Ken Seehof" <kens at sightreader.com> wrote in message news:mailman.987656068.4191.python-list at python.org...
"Dan Maas" <dmaas at nospam.dcine.com> says:
> > I've been saving up all the spam messages I get for the past two months.
> > I have about 1869 spam messages saved.
> > Now I'd like to develop a neural net based filter for my email program
> > and train it to recognize these messages as spam.
>
> Cool... I assume the main thing you are worrying about is accidentally
> rejecting non-spam emails, which might happen too easily with a
> naive keyword-based system.
>
> How about this - apply a whole set of tests to the message. Each test
> gives a "spammness" score - e.g. 10 points for being all caps, 50 points
> for having the word 'viagara', 100 points for having a suspicious From:
> address like *@yahoo.com. Add the scores from the different tests, and
> if the sum exceeds, say, 200 points, then call it "spam."
>
> So, how do you figure out a good value for each test score? This is where
> you could use a neural network or genetic algorithm. Pick a set of
> scores, feed the program lots of messages (both spam and non-spam), and
> see how accurate it is. Iterate until it rejects every spam email and
> accepts every non-spam...
>
> Dan
> --
> http://mail.python.org/mailman/listinfo/python-list
Excellent idea, Dan. That's conveniently sidesteps the most difficult
issue: getting the neural network to actually come up with linguistic
rules. Once an intelligent human specifies the set of rules, the neural
net should have no difficulty coming up with an optimal non-linear
function of pre-processed features (i.e. the "rules") to identify spam.
Analysis of the weights after training will help remove rules that turn
out to be irrelevant.
In other words, the input vector is simply the results from your
arbitrary rule set.
Since irrelevant rules are fairly harmless (other than decreasing
performance), one could initialize it to include a rule for every word
that occurs in spam messages more often than in non-spam
messages. Then supplement it with rules like the ones you mention.
Here's another idea for acquiring sample data. Send 'please send
me more info' messages to everyone who has sent you spam, with
your newly created spam recipient email address. Your address
will probably be sold to everyone. BTW, make sure your spam
recipient is on an ISP that does -not- defend against spam!
(Technically, it's not actually spam you'd be receiving since you
are explicitly requesting it, but close enough :-)
I want to be involved in this project. Let's take this offline.
- Ken
----------------------------------------------------
Copyright (c) 2001 by Ken Seehof
This document may not be distributed, copied,
duplicated, or replicated, or duplicated in any
form without express permission by Ken Seehof.
Permission is hereby granted.
kseehof at neuralintegrator.com
----------------------------------------------------
The opinions expressed herein are not necessarily
those of George W. Bush.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20010419/2475c4ab/attachment.html>
More information about the Python-list
mailing list