[Spambayes] Introducing myself

Sun Nov 10 21:59:28 2002

[my apologies if some of the suggestions/comments below have been previously
discussed, I'm still getting up to speed on the list]

>  > I'm particularly impressed with the chi-square work, it looks very
>>  interesting (but more stats for my poor stats-challenged mind to work
>>  on;
>
>So copy and paste <wink>.

Heh, call me old fashioned, but I actually like to know how things 
work, rather than relying on black magic.  ;^)

>  > not to mention that now I'm going to have to get around to
>>  cramming python in there with all the other languages that have
>>  accumulated over the years...).
>
>In return, you can throw twelve other languages out <0.7 wink>.

Why would I ever want to do that?  You never know when you'll need to 
be able to remember PL/C, JPL, APL, TUTOR, etc., etc., etc.  Though I 
pray I never have to remember NOVA MOBOL ("Language of Kings") ;^)

>Testing has pretty much run out of steam here, though.  My error rates are
>so low now I couldn't measure an improvement in a convincing way even if one
>were to be made, and the same is true of a few others here too.  We appear
>to be fresh out of big algorithmic wins, so are pushing on to wrestling with
>deployment issues.

Indeed.  And you also have to start worrying about the metagame; 
assuming your system goes into widespread deployment, what will the 
intelligent spammer (oxymoron) responses be?

>BTW, download the source code and read the comments in tokenizer.py:  the
>results of many early experiments are given there in comment blocks.

Will be doing this over the next day or so.

>Spoken like someone who worked on a rule-based system <wink>.  We have three
>categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
>believe that a finer distinction than that can be quantitatively justified
>(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
>what I mean by "can't measure an improvement anymore", and a finer-grained
>scheme isn't going to touch those 2 mistakes; one of them is formally ham
>because it was sent by a real person, but consists of a one-line comment
>followed by a quote of an entire Nigerian scam spam -- nothing useful is
>ever going to *call* that one ham, and it scores as spam *almost* as solidly
>as an original Nigerian spam).

Ah, but there are more considerations.  First, many people's training 
sets may not be as distinct as yours, so the results might be more 
blurry.  Second, future versions of the software might end up 
including other recognizers in the mix (for example, DNSBL, url 
heuristics, whitelists, stamping systems, etc), so adding a bit of 
flexibility at the start doesn't cost you anything, but could end up 
saving everyone a lot of work down the road.  Since most existing 
mailreader filter schemes are relatively primitive, more than 10 
levels of discrimination isn't going to be all that useful.  But only 
3 would seem to be to be too few.  In a 1-9 scheme, the current 3 
levels would map to (say), 2,5,8.

It's just a syntactic difference, but it gives you precious wiggle room.

>"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
>"not spam" was my invention, and since that one caught on big, I'm not
>fighting to the death for any others <wink>.

Hey, why quit when you're on a roll?

>
>>  * Hashing to a 32-bit token is very fast, saves a ton of memory,
>>  and the number of collisions using the python hash (I appealed for hash
>>  functions on the hackers-l and Guido was kind enough to send me the
>>  source) is low.  About 1100 collisions out of 3.3 million unique
>>  tokens on a training set I was using.
>
>That's significantly better than you could expect from a truly random hash
>function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
>should leave 3298733 buckets occupied on average, with an sdev of 35.58
>buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
>mean.

I may have gotten the # of tokens wrong.  Currently my test runs are 
using 3.3M tokens but it may have been fewer when I was doing the 
hash tests.  Maybe 2.3-2.4M tokens at that time?  Anyway, thanks for 
the info about the relative merits of CRC32 and the Python hash; I'd 
been told CRC32 was bad and so was really surprised when it was 
marginally better.

>Since we're sticking to unigrams, we don't have an insane database burden.
>We also (by default) limit ourselves to looking at no more than 150 words
>per msg.  So I'm not sure saving some bytes of string storage is "worth it"
>for us, and it's very nice that we can get back the exact list of words that
>went into computing a score later.  A pile of hash codes wouldn't give the
>same loving effect <wink>.

Well, unless I'm missing something, you've got to keep track of every 
token you've ever seen, and you've got to look up every token you 
encounter to determine if it's significant enough to consider in the 
final calc.  If so, assuming the final calc isn't exponential, 
reducing the lookup time/resources can be a big win performance-wise.

Note that since you have the text of the token before you hash it, 
you can keep that around for significant tokens and display it later. 
The only reason to hash is for speed of access to the probability 
data.  The cost of the hashing is the inevitable collisions, which 
blur the probabilities for colliding tokens.

>Except I didn't get good enough results from his approach to justify
>pursuing it here, even leaving the hash codes at the full 32 bits.  When I
>went on to squash them to fit in a million buckets, a few false positives
>popped up that were just too bad to bear (two can be found in the list
>archives):  ham that was so obviously ham that no system that called them
>spam would be acceptable to most people.

I wasn't commenting on the phrase system, or even hashing, but rather 
on data reduction to reduce the memory footprint required of the 
statistical tables (ie: using 1 byte frequency counts vs. 4 byte 
ones).

Also, a cautionary note: just because the current system doesn't 
generate any horrible false positives on your corpii doesn't mean it 
won't do so on Joe Schmoe's.  Or my slightly smelly ham.

>  > * I was playing a week or two back with 1 and 2 token groups, and
>>  found that a useful technique was, for each new token, to only
>>  consider the most deviant result.  So if the individual word was .99
>>  spam, and the two word phrase was .95, it would only consider the .99
>>  result.  This would probably help with Bill Y's combinatorial scheme.
>
>It could be a viable approach to the problem mentioned above:  a scheme to
>suck out more than one word that doesn't systematically generate mounds of
>nearly redundant (highly correlated) clues.  We're clearly missing info by
>never looking at bigrams (or beyond) now, and that continues to bother me
>(even if it doesn't seem to be bothering the error rates <wink>).

Right; and, related to the metagame, you've got to consider responses 
by the spammers.  The initial attempt to defeat these kind of 
recognizers is going to try and exploit cancellation disease, 
probably by having a spammy preamble and a very hammy postscript.

So one possible approach would be to gradually degrade the 
significance of a token the further along in the email it is (both 
during training and recognition).  But of course, then you'll have to 
watch for html email that loads the front of the message with 
invisible ham.  So a parser that spits out only the tokens a human is 
going to see is indicated.

>  > * My personal bias (as I think Guido mentioned) is for a multifaceted
>>  approach, using Bayesian, rules-based (attacking things that bayesian
>>  isn't good at, like looking for obfuscated url structures), DNSBL,
>>  and whitelisting heuristics to generate an overall ranking.  So a
>>  hammy mail from a guy in your address book would bubble up to highest
>>  priority, whereas something spammy from him would stay neutral.
>
>I'm not sure we really need it.  For example, *lots* of spam has been
>discussed on this mailing list, so much so that the python.org email admin
>had to castrate SpamAssassin for msgs to this list address else it kept
>blocking ordinary list traffic.  My personal email classifier never calls
>anything here spam, though, nor does it call the originals of the spams
>posted here ham.

Beware the One True Path.  There is strength in diversity.

Or, as the noted philosopher D. Vader put it, "Don't be too proud of 
this technological terror you have created."  As you will recall, 
those rebel scum managed to craft a nasty false positive.

>
>I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
>by default for a reason I've harped on enough <wink>:  all HTML decorations
>have very high spamprobs, and counting more than one of them as "a clue"
>fools almost every combining scheme into believing the msg containing them
>is spam (if you know a msg contains both <br> and <p>, it's not really more
>likely to be spam than if you just know it contains <br>!).  So we blind the
>classifier to HTML decorations now.
>
>But a spam I forwarded here a week or so ago exploited that:  the spam was
>interleaved with size=1 white-on-white news stories and tech mailing list
>postings.  The classifier *did* see those, but didn't see the HTML
>decorations hiding them.  This was a cancellation-disease-by-construction
>kind of msg, and chi-combining scored it near 0.5 as a result (solidly
>Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
>popular techinque, we'll have to take more HTML blinders off the classifier.

That's a classic example of metagaming.  Seems to me, the strength of 
the spambayes recognizer is in recognizing the semantics (the spammy 
meaning of the message), not the syntactics.  So train it only on 
what a human would see reading the message.  Have another recognizer 
(either rules-based, bayesian, whatever works) that deals with the 
syntactics, and picks up on the html decoration tricks.  In other 
words, one that looks at what the message says, and another that 
looks at how it is presented.  This will prevent that particular kind 
of simple cancellation attacks.

And that wraps back to the "more responses" suggestion above.  How do 
you rate a hammy message with spammy html ornaments?  Might not "a 
little hammy" be a better response than "beat's me, boss!"?

>
>>  There's lots of room for cooperation between the various approaches
>>  and multiple agents means its less likely that a spam will get by.
>>  In particular, whitelisting heuristics can almost eliminate false
>>  positives.
>
>I'll let you know if I ever see one <wink>.

You will.  And it will be the one email that you really, really 
needed to read.  Murphy's Law guarantees that it will happen.  In 
fact, it typically happens (in my painful personal experience) soon 
after you make comments like the above.

>Getting vast quantities of spam isn't a problem anymore, but getting vast
>quantities of ham is.  Since your spammy ham is presumably business-related,
>I assume you can't share it.  Or can you?

Probably not.  Unless I could process them and just give you the 
tokens and frequencies in some useable format.  I'll see what I can 
do next week, gotta get python up and running along on my Mac.  Also 
gotta get the battlebot finished or my kids will hurt me.

>   Mixing spam and ham from
>different sources also causes worlds of problems (indeed, we still (by
>default) ignore most of the header lines partly for that reason, else the
>system gets great results for bogus reasons).

I do the same, I'm currently just looking at the subject line.

At 12:09 PM +0100 11/10/02, Rob Hooft wrote:
>I think our very good experience with the bayesian classifier would 
>"forbid" to use whitelisting. Once a whitelisted feature "leaks" 
>into the spam community, it will be useless.

Not if the whitelist heuristics are based on the individual user's 
environment, as opposed to global features.

>But there is a bayesian solution to it: Make the tokenizer recognize 
>the feature that you want to whitelist or blacklist, and emit a new 
>token to that effect.
>
>    From:<in-address-book>  --> Will have a low spamprob
>    url:numeric-host        --> Will have a high spamprob

While this is a useful approach, there is (IMHO) a need for users to 
be able to override, or at least modulate, the bayesian results in 
certain circumstances.  The classic example would be your boss 
forwarding a 419 scam to you with the comment "Looks good, I'm going 
to invest in this, what do you think?".  The spamminess might 
overwhelm the low spamprob From:<in-address-book>

A (paranoid) user needs to be able to tell the system "I don't care 
how spammy an email looks, if it's got this feature, I've got to at 
least glance at it with the Mk.1 Eyeball Recognition System".  Note 
that this doesn't mean that it should be declared "clean as the 
driven snow", just "might not be a pile of decomposing lunchmeat"

Yeah, this means that every spam going into Microsoft will eventually 
be from "billg@microsoft.com", but the consequences of this might be 
interesting.  Or at least, amusing.

best,R

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."