Graham's spam filter (was Lisp to Python translation criticism?)

Sat Aug 17 23:45:30 EDT 2002

  "John E. Barham" wrote:

>>Nice of the spammers to be giving us so much data to work with!
> 
> 
> Here's my implementation of Graham's statistical filter in Python.  It's
> based on a Corpus class (a specialized dictionary) that processes data
> (each call of the .process method should be the entire concatenated text
> of a distinct message).  One builds up two corpora [had to look that one
> up!] -- good and bad -- and then hands them to a Database instance,
> which computes the appropriate probability table.  When you want to test
> a new message, create a Corpus for it and then pass it to the database's
> .scan method, which will return the computed probability of the message
> being spam.

All groups of five successive characters in a text can also be used for 
classifying the text. This seems to be a method NSA uses. See:

Marc Damashek, "Gauging Similarity with n-Grams: Language-Independent
Categorization of Text", Science, 267, 843-848, 10 February 1995.

This paper is in PDF at

http://gnowledge.sourceforge.net/damashek-ngrams.pdf

See also:

http://www.sscnet.ucla.edu/geog/gessler/167-2001/ngrams.htm
http://www.cs.umbc.edu/www/research/projects/telltale.html
http://www.cs.umbc.edu/~mayfield/ngrams.html

On Google: ngrams damashek  or just  ngrams
At http://citeseer.nj.nec.com/ search for damashek.

Starting from the Dameshek search in Citeseer, it is easy to find a 
large academic literature on ngrams, statistical text classification, 
relevent hashing algorithms, etc.