[Spambayes] how spambayes handles image-only spams

Bill Yerazunis wsy at merl.com
Sun Sep 7 22:10:13 EDT 2003


   From: "Tim Peters" <tim.one at comcast.net>

   > so you both would, and would not get features corresponding to =value
   > stuff.

   I don't know -- it depends on how you split the bytestream into units.  When
   Ryan hypothesized he'd see "img src http", it's assuming all sorts of stuff,
   such as that you drop the "<" presumably preceding "img", drop the '="'
   presumably following "src", fold case, and so on.

The current tokenizing regex in the CRM114 distribution is:

    [[:graph:]][-.,:[:alnum:]]*[[:graph:]]?

which was handcrafted to make good tokens out of a lot of stuff, from
HTML to IP addresses.

The < doesn't get dropped, nor would the closing > on a tag, but
it _would_ break img src=foobar into
   img
   src=
   foobar

with the usual sparse window mumbo-jumbo after that.

   > If your spam uses a lot of tables and font colors, but your regular
   > email does not, then <table>, <td>, <font> etc will have "spam"
   > statistics but <p>, <br>, and <h1> won't.

   Try it and see what happens.  spambayes only uses unigrams, and we had
   plenty of experience with this.  CRM114 will have all those unigrams too,
   plus a giant (compared to just the unigrams) collection of "hey, there was
   HTML!" pairs, triples, etc.  The unigrams alone are enough to kill the
   classifier's usefulness for people with little (but not no) HTML ham.

See previous message - in the SA test corpus (nasty one, too) CRM114 
found that there wasn't a lot of info in the HTML... but still delivered
better than 98 per cent accuracy.  This is on the corpus that I myself
can rarely do even 70% on manually.

   > ...
   > It still uses hashing, but now with effectively a 64-bit key that _is_
   > checked.

   Don't know what it means to check a 64-bit key.  Maybe it means you're using
   64-bit hash codes now, and that a hash code leads to a chain of original
   tokens hashing to that code, and "check" means the incoming token is
   compared to the entries in the hash chain.  Or maybe it doesn't <wink>.

Yep.  The hash is cut down and used to choose a starting place in
the .css file, then the file is searched for an _exact_ match 
to that 64-bit hash.  

   >  The chance of a hash clash is something like 10^-20th,

   The chance of a hash clash on a pair of 64-bit keys is 2**-64, of course,
   but you can expect to see the first collision after hashing O(2**32) tokens.

Um, yeah.  After a couple of billion independent tokens, you might
expect a collision.

But since the size of the .css file is far smaller, you'd have to take into
account that grooming will expire old tokens out and so you never really
get that many "live" token entries at any one time.

Right now, people are using 1 to 4 million tokens in their .css files.

   The Birthday Paradox lowers it back into the comfortably small billions
   <wink>.

Well, what's the Birthday Paradox limit for for when you only 
need a birthday collision with someone in the previous 10 million or so
entries? :-)

	 -Bill Yerazunis



More information about the Spambayes mailing list