[spambayes-dev] RE: Network checks

Sat Sep 20 22:23:26 EDT 2003

On Sat, 20 Sep 2003 20:13:10 -0400, Tim Peters wrote:

> [Sean R. Lynch]

> So far you sound determined to overrule your father's judgment about
> what he does and doesn't like.  In that case, spambayes may be the worst
> classifier available <wink>.

That's not the problem, and perhaps I wasn't explicit about what the real
problem is. Learning filters work well because there are large differences
between what computer geeks like us consider spam and ham, which you will
see if you feed *your* spam and ham through a Kohonen network. There is a
well-defined dividing line between them. However, if your ham looks a lot
like spam and vice-versa, there is no such clean dividing line, and the
thing just isn't going to be as good. So what I actually want to do is
give a suggestion to my father that something was sent by a spammer (i.e.
stick it in his quarantine folder until he starts releasing that sort of
mail from quarantine).

>> is to have a good filter to begin with.  SpamAssassin seems like it
>> would be reasonable, but if I'm gonna use SpamAssassin, why not use its
>> built-in Bayesian filter? The main reason I won't is that I really want
>> to use SpamAssassin's network checks, and IMHO it's bad netizenship to
>> run them more than once on the same message, and enough messages go to
>> multiple users on my server that I'd really like to run SA as a content
>> filter.
> 
> You could run SA as a rule-based content filter and disable its network
> checks.

I could, but half the reason I want SA is for the network checks, because
the network checks are something that are specifically not available to a
learning filter right now.

>> I think that Bayesian filters really need to include their training
>> time in performance analyses, rather than just comparing their ultimate
>> performance after being trained. The "best" of the Bayesian filters
>> seem to require the longest training times, and I don't really consider
>> this to be a good thing, because "training time" really translates to
>> both false positives and false negatives (an unsure is a false negative
>> as far as I'm concerned).
> 
> It's extremely dangerous to consider an unsure to be a false negative.
> They're unsure precisely because, based on the training a classifier has
> been given, the evidence in favor of ham is approximately equal to the
> evidence in favor of spam.  It does appear to be the case that some
> people usually end up thinking an unsure is spam.  This isn't universal,
> though! For example, it varies from time to time in my classifier, and
> the past few weeks I've considered most unsures to be ham (of the ones I
> could make up my own mind about!  I throw away about half my unsures
> untrained-on, because I have no idea what they're on about -- might be
> ham, might be spam, but they're too confusing regardless to be worth the
> effort of researching). Different people get very different mixtures of
> email.

I think I misspoke a bit here. What I meant was, classifying a spam as an
unsure is a false negative.

> If you're convinced that all unsures are really spam, move your spam
> cutoff lower.  If you move it far enough so that it equals your ham
> cutoff, you'll never see another unsure again.  I recommend against it,
> but suit yourself.

Actually, I like unsures, and one idea I've been thinking about is to fall
back to the rule-based filter if the learning filter gives an unsure.

>> If IP addresses, email addresses (in the body), domains, and URLs could
>> be shared among users of Bayesian filters, I think this would reduce
>> training time significantly, because there are large numbers of each of
>> them out there, but they have the potential to be the biggest spam
>> clues.
> 
> Sharing can be helpful for people with a shared sense of what spam is.
> For example, all the email sent to tech mailing lists via python.org
> goes thru a spambays classifier, and tens of thousands of mailing list
> recipients benefit from sharing what their mailing lists' classifiers'
> have taught about spam.  This is appropriate, because most tech mailing
> lists have a very strong shared definition of spam (commerical messages
> of any type are considered spam, except for those highly specific to the
> mailing list topic (in which case the message necessarily contains lots
> of words hammy wrt the list topic)).
> 
> This is a very easy form of sharing, of course, because it's confined to
> one classifier.  Fancier schemes would require setting up distributed
> trust networks, etc.  In the end, I bet that subsystem would dwarf the
> current spambayes codebase.  IOW, lots of work.

I was thinking more along the lines of how Razor and DCC currently work
with message digests, but with IP addresses, and giving scores along the
lines of DCC or Razor's confidence measure. Yes, this is lots of work, but
we're making smarter spammers, and we can either stay ahead of them or
keep playing catch-up. The training time already makes a learning filter
hard for my dad to use. When it has to be continually trained, that's
going to be worse. As I said before, training time should be considered as
part of the performance metric.

>> Email addresses, domains, and URLs are harder, because IMHO they can
>> really only be used as spam clues if they're going to be shared.
> 
> I don't agree.  For example, the .biz domain appearing in a URL is a
> very strong spam clue in my classifier, and for what should be obvious
> reasons. It could be *better* if such things were shared, but they're of
> real use in an individual classifier already.

How about .com domains? Eventually .biz won't be useful as a clue any more
than .com is. If SA supports checking these against blacklists, that's
great, and it can be implemented in a preprocessor for a learning filter.

>> but I think it might be better to use them to generate more features
>> for the Bayesian filters to use for classification...
> 
> Some clues are so strong that they're (IMO) better suited to rule-based
> systems.  Ours is a preponderance-of-evidence system, where no clue on
> its own is strong enough to drive the final decision.  But if I can
> determine "Korean character set and from an open relay", then it's
> certain to be spam for me.  spambayes isn't well suited to exploiting
> such killer-strong criteria.

That's an interesting point. However, some blacklists are better suited to
rule-based systems than others. I would not, for example, use the XBL in a
rule-based system, but I might use it in a preprocessor for a learning
filter. SBL and the Wirehub Permblock, on the other hand, seem quite
reliable, so it makes sense to use them in a rule-based classifier.

>> some sort script that just adds a bunch of keywords to the headers
> < based on the result of network checks. This combined with a
>> pre-trained global database that only handles features that are missing
>> from the user's own database (ala spamprobe) would be great for a
>> commercial spam filtering engine that requires no training time to be
>> decent, and becomes very good with only a little training.
> 
> You won't know that until you test an implementation in real life.  Lots
> of people have lots of good-sounding arguments about what will and won't
> work. We did the work here of implementing and rigorously testing our
> ideas.  Most of them failed in real life, BTW.

Of course. I'm not telling you you should implement this because it will
work, I'm saying here's my idea and I'd like feedback on it.

I think what I really want is a good framework for combining a rule-based
classifier with relatively stable rules (i.e. leaving out a lot of SA's
body checks), a preprocessor that adds some tokens based on network
checks, and a learning classifier. For example, it struck me as somewhat
strange that SpamBayes bothers to de-obfuscate text that's broken up with
<frames><noframes>... when anything that uses such an approach is
obviously spam. Likewise, something with an obfuscated URL is also almost
certainly spam. At the same time, SB doesn't seem to bother using the
existence of an obfuscated URL or text as a feature for classification.
Basically it's kind of like SpamAssassin but *requiring* the learning
classifier, and allowing the learning classifier to be per-user but having
the rule-based classifier run per-message. Individual users shouldn't need
to muck with the settings of the rule-based classifier or preprocessor
because they should be quite generic.

I could use the "fall back to rule-based when unsure" approach, but I'd
actually like to do the combination in a way that will build on the
strengths of both filters and reduce both false negatives and false
positives. Actually, if I can work with the raw scores of both, and get
some sort of confidence measure from the learning classifier based on how
much training it's received, I can change what I consider to be unsure
over time (the range of unsures needs to widen as the learning classifier
gets more training), and this might actually be a useful approach. 

Sorry about writing yet another novel, but I've been thinking about this
for a long time  :)