[spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15

Seth Goodman sethg at GoodmanAssociates.com
Sat Apr 17 14:34:44 EDT 2004

> From: Thomas Juntunen
> Sent: Friday, April 16, 2004 8:54 PM


> http://www.qaqd.com/research/spam-e1.htm

I would like to read this article, but the link redirects to a login page
that doesn't accept 'guest', 'anonymous' or an email address as a login.
Could you provide another link or send me a copy of the article?

>From Skip's post, he mentioned principle component analysis as the technique
the author used.  If this is the same as the method by that name we use in
electrical engineering, this means decomposing a signal into a series of
Eigenvectors (orthogonal components), each with a length (the Eigenvalue)
that indicates the strength (electrical power) of that particular component.
You then throw away the components that are similar in size to those that
are known to be noise (completely random, no information content), leaving
what are called the principle components.  Under good conditions, the
principle components comprise _most_ of the information portion of the
signal, though it doesn't always come out that way.  This is but one of many
methods for breaking a signal down into orthogonal components and removing
noise.  The method has its pro's and con's, which have a lot to do with the
nature of the signal and how much you know about it ahead of time.

I can think of several issues applying PC analysis to a text message instead
of a signal stream.  Since a text message can be parsed in different ways to
create a signal to do the Eigendecomposition on, results will depend on
whether you treat it as a bit stream, a character stream (with what
character length?) or a token stream (tokenized how?).  It would also be
possible to treat the SpamAssassin results as tokens and use only those to
create a token stream.

I need to read the article, but applying Eigendecomposition to a text
message raises a lot of questions for me.


Seth Goodman

More information about the spambayes-dev mailing list