Graham's spam filter (was Lisp to Python translation criticism?)

Edward C. Jones edcjones at
Sat Aug 17 23:45:30 EDT 2002

  "John E. Barham" wrote:

>>Nice of the spammers to be giving us so much data to work with!
> Here's my implementation of Graham's statistical filter in Python.  It's
> based on a Corpus class (a specialized dictionary) that processes data
> (each call of the .process method should be the entire concatenated text
> of a distinct message).  One builds up two corpora [had to look that one
> up!] -- good and bad -- and then hands them to a Database instance,
> which computes the appropriate probability table.  When you want to test
> a new message, create a Corpus for it and then pass it to the database's
> .scan method, which will return the computed probability of the message
> being spam.

All groups of five successive characters in a text can also be used for 
classifying the text. This seems to be a method NSA uses. See:

Marc Damashek, "Gauging Similarity with n-Grams: Language-Independent
Categorization of Text", Science, 267, 843-848, 10 February 1995.

This paper is in PDF at

See also:

On Google: ngrams damashek  or just  ngrams
At search for damashek.

Starting from the Dameshek search in Citeseer, it is easy to find a 
large academic literature on ngrams, statistical text classification, 
relevent hashing algorithms, etc.

More information about the Python-list mailing list