[Spambayes] Spam in Images

Wed Aug 2 03:09:33 CEST 2006

    Alan> I don't think the image size works.  I just saved about 20 of my
    Alan> most recent spam images and while the vast majority (1/2) are
    Alan> pushing a stock and most of them are pretty similar in size they
    Alan> aren't all the same.

We use a trick for sizes that tends to work pretty well.  Instead of noting
the precise size, we note the log of the size in base 2 and then throw away
the fraction.  I just implemented that for image sizes and got these results
using my current training database:

    token,nspam,nham,spam prob
    image-size:2**5,1,0,0.844827586207
    image-size:2**6,4,1,0.5
    image-size:2**7,4,1,0.5
    image-size:2**8,6,0,0.96511627907
    image-size:2**9,3,0,0.934782608696
    image-size:2**10,7,1,0.620791675168
    image-size:2**11,9,0,0.97619047619
    image-size:2**12,13,0,0.983271375465
    image-size:2**13,14,0,0.984429065744
    image-size:2**14,53,0,0.995790458372
    image-size:2**15,19,1,0.813543282782

That doesn't necessarily mean much without some testing.  I don't tend to
get a lot of ham with images.  I'll create a patch and add it to the
SpamBayes website so others can try it out.

Skip