[Spambayes] Re: There Can Be Only One

Wed, 25 Sep 2002 02:03:56 -0400

[Harald Koch]
> I don't know about you, but I receive about 250 spam a *month*.

Now you know about me too <wink>:  I get more than 100 a day, not counting
my hotmail account.

> It would take me 8 months to collect 2000 spams. However, the spam
> I was receiving eight months ago is generally quite different from
> the spam I'm receiving now, so even if I *did* have an archive going
> back that far it wouldn't be terribly useful.

I'm not sure what "quite different" means.  I've been getting porn spam, and
get-rich-quick spam, and human-growth-hormone spam (etc), for a loooooong
time.  Ooh!  A new one just came while I was typing:  "Dear Valued Friend,
Great news! Your new 50% Hotel Discount Card is ready to be downloaded!
[yadda yadda yadda]" ... there's nothing unique about that.

As to whether an old archive would or wouldn't be useful, that's something
that could be tested (and should be).

Another one just came in:

    Subject: Change your life!!! Last time you spent $25 did u make
        $100,000,s? Well this time you will. pvh

    Hello

    You may have seen this business before and
    ignored it. I know I did - many times! However,
    please take a few moments to read this letter.
    I was amazed when the profit potential of this
    business finally sunk in... and it works!

    [yadda yadda yadda]

I've been seeing that one for years too (it's the usual chain letter).

Are you using an email account that already has some sort of spam filtering
in place?  If so, that would explain why only unusual spam gets through to
you.  I do nothing to try to stop spam.

Another one just came in:  the usual Viagra pitch.

> One of the purported strengths of Paul's original idea was that it was
> *adaptive*.

Indeed yes!  We didn't take that part out, and I've run experiments starting
by training on a single ham and a single spam.  With just that much, the fp
rate was about 33% and the fn rate about 20%, predicting against msgs at
random from my test data.  Add another ham and spam to the training, and it
does better.  Etc -- it certainly learns.  Something I suspect but haven't
had time to test:  there's a critical mass of spam messages such that once
you've trained on that many, you've basically seen them all.  4,000 isn't
enough.  10,000 may be.

> Based on my recent reading of this list, I believe two important facts
> about spam in the wild are being downplayed:
>
> - mail headers (I understand *why* you're doing it, but the
>   discriminators on *my* spam vs. ham *do* often come from the headers.
>   Spammers seem to all find the same open relays at the same time ;-)

Other people on this list are mining the headers.  I can't, until I get a
single-source corpus.  This has also been a strength for us:  I started
without looking at *any* headers (they were chopped off the msg first
thing), only at content, and both error rates from the original scheme got
chopped by at least a factor of 10 each since then.  It's hard to imagine
that adding rich clues from the headers is going to make that worse,
although only time and testing will tell.

> - *time*. my spam is self-similar over short periods of time, and (with
>   some exceptions) changes and evolves over longer periods. Token
>   statistics collected eight months ago wouldn't eliminate much of my
>   current spam. Heck, if i train the classifier on only the old spam and
>   then run it against all of the new spam, the f-n rate is abysmal.

It would be interesting to try the same experiment with our classifier; I
expect our content tokenization is more powerful, simply because ours
started without the headers all.  For example, the false negative rate was
cut in half just by special parsing and tagging of embedded URLs; e.g.,
"remove" in a message body isn't particularly incriminating, but "remove"
embedded in a URL specifically is plain damning.

> I'm running a very simple perl version of the algorithm right now. The
> thing that most aggressively lowers my f-n rate is my daily inbox cull;
> I feed f-n spam into the classifier whenever I find it.

You didn't say how many total spam you've trained it on, but you only get
250 a month and this scheme isn't much older than that.  If you have
relatively few data points, then it's not surprising that feeding it more
helps a lot.

>> Q. This is a ham/spam ratio of 1.  Is that realistic?
>> A. We can't test everything at once.

> It can be, especially at certain times of the day; I get most of my
> spam at night and most of my regular email during the day
> (North America time).

I believe it!  I expect there are great clues in the headers about this that
I'm missing too.

>> Q. 1800 each of ham & spam is a very large training set.
>>    Wouldn't it be better to use less training data?
>> A. We can't test everything at once.
>
> <laughter>

My usual test runs on 20,000 ham and 14,000 spam.  An original purpose of
this project was to investigate whether this scheme would be suitable for
filtering the email traffic going through python.org.  Compared to that
purpose, the test sizes I'm using are a drop in the bucket.  Even with the
limited header analysis I'm doing, on that large test I have a total of 4
false positives and 24 false negatives.  How's the Perl script doing <wink>?

Another spam just came in:

    Subject: Look At the Quality of our cars

    Dear Sir / Madam

    We are a Singapore based company dealing in Quality Used Cars and
    looking for Dealers in countries that is Right Hand Drive.

    Take a look at the quality of our cars below, click on the Number
    Plate beside the price to see photo fo the car.

    [yadda yadda yadda]

BTW, all the spams that came in while I typed this were nailed by the
classifier -- they're simply timeless in their relentless spammishness
<wink>.