Graham's spam filter (was Lisp to Python translation criticism?)
Edward C. Jones
edcjones at erols.com
Sat Aug 17 23:45:30 EDT 2002
"John E. Barham" wrote:
>>Nice of the spammers to be giving us so much data to work with!
>
>
> Here's my implementation of Graham's statistical filter in Python. It's
> based on a Corpus class (a specialized dictionary) that processes data
> (each call of the .process method should be the entire concatenated text
> of a distinct message). One builds up two corpora [had to look that one
> up!] -- good and bad -- and then hands them to a Database instance,
> which computes the appropriate probability table. When you want to test
> a new message, create a Corpus for it and then pass it to the database's
> .scan method, which will return the computed probability of the message
> being spam.
All groups of five successive characters in a text can also be used for
classifying the text. This seems to be a method NSA uses. See:
Marc Damashek, "Gauging Similarity with n-Grams: Language-Independent
Categorization of Text", Science, 267, 843-848, 10 February 1995.
This paper is in PDF at
http://gnowledge.sourceforge.net/damashek-ngrams.pdf
See also:
http://www.sscnet.ucla.edu/geog/gessler/167-2001/ngrams.htm
http://www.cs.umbc.edu/www/research/projects/telltale.html
http://www.cs.umbc.edu/~mayfield/ngrams.html
On Google: ngrams damashek or just ngrams
At http://citeseer.nj.nec.com/ search for damashek.
Starting from the Dameshek search in Citeseer, it is easy to find a
large academic literature on ngrams, statistical text classification,
relevent hashing algorithms, etc.
More information about the Python-list
mailing list