[Spambayes] how spambayes handles image-only spams
Bill Yerazunis
wsy at merl.com
Sun Sep 7 22:10:13 EDT 2003
From: "Tim Peters" <tim.one at comcast.net>
> so you both would, and would not get features corresponding to =value
> stuff.
I don't know -- it depends on how you split the bytestream into units. When
Ryan hypothesized he'd see "img src http", it's assuming all sorts of stuff,
such as that you drop the "<" presumably preceding "img", drop the '="'
presumably following "src", fold case, and so on.
The current tokenizing regex in the CRM114 distribution is:
[[:graph:]][-.,:[:alnum:]]*[[:graph:]]?
which was handcrafted to make good tokens out of a lot of stuff, from
HTML to IP addresses.
The < doesn't get dropped, nor would the closing > on a tag, but
it _would_ break img src=foobar into
img
src=
foobar
with the usual sparse window mumbo-jumbo after that.
> If your spam uses a lot of tables and font colors, but your regular
> email does not, then <table>, <td>, <font> etc will have "spam"
> statistics but <p>, <br>, and <h1> won't.
Try it and see what happens. spambayes only uses unigrams, and we had
plenty of experience with this. CRM114 will have all those unigrams too,
plus a giant (compared to just the unigrams) collection of "hey, there was
HTML!" pairs, triples, etc. The unigrams alone are enough to kill the
classifier's usefulness for people with little (but not no) HTML ham.
See previous message - in the SA test corpus (nasty one, too) CRM114
found that there wasn't a lot of info in the HTML... but still delivered
better than 98 per cent accuracy. This is on the corpus that I myself
can rarely do even 70% on manually.
> ...
> It still uses hashing, but now with effectively a 64-bit key that _is_
> checked.
Don't know what it means to check a 64-bit key. Maybe it means you're using
64-bit hash codes now, and that a hash code leads to a chain of original
tokens hashing to that code, and "check" means the incoming token is
compared to the entries in the hash chain. Or maybe it doesn't <wink>.
Yep. The hash is cut down and used to choose a starting place in
the .css file, then the file is searched for an _exact_ match
to that 64-bit hash.
> The chance of a hash clash is something like 10^-20th,
The chance of a hash clash on a pair of 64-bit keys is 2**-64, of course,
but you can expect to see the first collision after hashing O(2**32) tokens.
Um, yeah. After a couple of billion independent tokens, you might
expect a collision.
But since the size of the .css file is far smaller, you'd have to take into
account that grooming will expire old tokens out and so you never really
get that many "live" token entries at any one time.
Right now, people are using 1 to 4 million tokens in their .css files.
The Birthday Paradox lowers it back into the comfortably small billions
<wink>.
Well, what's the Birthday Paradox limit for for when you only
need a birthday collision with someone in the previous 10 million or so
entries? :-)
-Bill Yerazunis
More information about the Spambayes
mailing list