[Spambayes] Re: There Can Be Only One

Tim Peters tim.one@comcast.net
Thu, 26 Sep 2002 12:28:35 -0400


[Greg Ward, on the "most embarrassing" fn in my last full-test run]

> Just for grins, I ran this one through SpamAssassin 2.41 (the latest,
> but not yet on mail.python.org).  SA had no trouble calling it spam:
> ...
> That give you any ideas for tokenization hacks?

This is an excellent idea.  Note that many of the clues came from the
headers, and by default most of our header tokenization is turned off.  I
hope other people jump on the things where we could be mining more clues --
I'm spead too thin.

> SPAM: ---- Start SpamAssassin results
> SPAM: 20.30 hits, 5 required;
> SPAM: *  2.2 -- From: has a malformed address

We have no code to catch that.

> SPAM: *  1.7 -- Message-Id has no @ sign

Ditto.

> SPAM: *  1.6 -- Invalid Date: header (not RFC 2822)

Ditto.

> SPAM: *  1.2 -- Message-Id is not valid, according to RFC 2822

Ditto.

> SPAM: *  1.1 -- Header with all capitals found

We caught several of those in this msg, via Anthony's case-sensitive
header-counting code, restricted (by default) to the list of header fields
in option safe_headers.  Note that we don't say they're ham or spam:  as
with all things, we merely produce "words", and the system learns on its own
which tokens are hammish, which spammish, and which neutral.

> SPAM: *  4.3 -- BODY: Claims you can be removed from the list

We caught clues related to this.

> SPAM: *  1.0 -- BODY: List removal information

Ditto.

> SPAM: * -0.1 -- BODY: Spam phrases score is 08 to 13 (medium)
> SPAM:           [score: 10]

This scheme doesn't look for phrases, although it could be taught to.  Note
that I've gotten *worse* results via using word bigrams (closer to
"phrases"), although a comment in tokenizer.py details a scheme that, at
least at the time, gave a major improvement in f-n rate via producing
bigrams and unigrams, and doing a tiny bit of lemmatization (just stripping
a trailing 's'); that also used runs of alphanumerics instead of splitting
on whitespace.  It hurt the f-p rate, though.  The tradeoffs here may be
different under Gary's scheme, so "every time you change anything all
previous decisions should be revisited" applies, as always.

> SPAM: *  1.0 -- URI: Includes a link to a likely spammer email address

We do a lot of special tokenization of URIs already, and I expect the system
has already learned more about spammish things in URIs than people will ever
figure out by thinking about it.

> SPAM: *  2.7 -- Date: is 24 to 48 hours before Received: date

We have no handle on that.

> SPAM: *  1.4 -- Missing To: header

We currently infer hammishness by the presence of a To: header, but don't
generate a token indicating the absence of a To: header.  Jeremy has
suggested before that we do both; I thought he had checked in an option to
do so, but I can't seem to find it.

> SPAM: *  2.2 -- RBL: Received via a relay in orbs.dorkslayers.com
> SPAM:           [RBL check: found 4.90.52.217.orbs.dorkslayers.com.]

Neil's mine_received_headers option (off by default) tokenizes IPs and their
prefixes, and will learn about networks associated with spam on its own.
However, the algorithm isn't going to learn *swiftly* when a new source
appears.  Given the error rates I'm seeing without looking at received
headers, though, I'm not sure I care <wink>.

> SPAM:
> SPAM: ---- End of SpamAssassin results

That's where we win in the end:  it's far from the end of the stuff this
scheme looks at, and this scheme learns about stuff nobody would ever dream
of teaching it.