[Spambayes] SpamBayes to Handle Embedded Images

Tue Oct 4 00:49:25 CEST 2005

FMJ,

You need to experiment with the following config options.  I do not have a
problem whatsoever with embedded images.  They usually link to a site and it
gets all of the related tokens.  Try for yourself and report back:  

[Classifier]

# Generate both unigrams (words) and bigrams (pairs of words). However,

# extending an idea originally from Gary Robinson, the message is

# 'tiled' into non-overlapping unigrams and bigrams, approximating the

# strongest outcome over all possible tilings. Note that to really test

# this option you need to retrain with it on, so that your database

# includes the bigrams - if you subsequently turn it off, these tokens

# will have no effect. This option will at least double your database

# size given the same training data, and will probably at least triple

# it. You may also wish to increase the max_discriminators (maximum

# number of extreme words) option if you enable this option, perhaps

# doubling or quadrupling it. It's not yet clear. Bigrams create many

# more hapaxes, and that seems to increase the brittleness of minimalist

# training regimes; increasing max_discriminators may help to soften

# that effect. OTOH, max_discriminators defaults to 150 in part because

# that makes it easy to prove that the chi-squared math is immune from

# numeric problems. Increase it too much, and insane results will

# eventually result (including fatal floating-point exceptions on some

# boxes). This option is experimental, and may be removed in a future

# release. We would appreciate feedback about it if you use it - email

# spambayes at python.org with your comments and results.

x-use_bigrams: True

[Tokenizer]

# This non-default option is very effective

# at nailing Asian spam with little training and small database burden.

# It should probably be exposed via the GUI, as it's not appropriate

# for people who get "high-bit ham".  Asian spam is nailed with this

# False too, but it requires more training and a larger database, since

# a sufficient variety of "8bit%" and "skip" metatokens take longer to

# learn about than strings of question marks.

replace_nonascii_chars: True

# It's helpful for Tim <wink>.

record_header_absence: True

# Recognize 'www.python.org' or ftp.python.org as URLs instead of just

# long words.

x-fancy_url_recognition: True

# Note whether url contains non-standard port or user/password elements.

x-pick_apart_urls: True

basic_header_tokenize: True

basic_header_skip: date x-.* domainkey-signature list-.*

check_octets: True

mine_received_headers: True

summarize_email_prefixes: True

summarize_email_suffixes: True

skip_max_word_size: 50

[URLRetriever]

# So that SpamBayes doesn't need to retrieve the same URL over and over

# again, it stores local copies of the text at the end of the URL. This

# is the directory that will be used for those copies.

x-cache_directory: url-cache

# This is the number of days that local cached copies of the text at the

# URLs will be stored for.

x-cache_expiry_days: 31

# To try and speed things up, and to avoid following unique URLS, if

# this option is enabled, SpamBayes will convert the URL to as basic a

# form it we can. All directory information is removed and the domain is

# reduced to the two (or three for those with a country TLD) top-most

# elements. For example,

# http://www.massey.ac.nz/~tameyer/index.html?you=me would become

# http://massey.ac.nz and http://id.example.com would become

# http://example.com This should have two beneficial effects: o It's

# unlikely that any information could be contained in this 'base' url

# that could identify the user (unless they have a *lot* of domains). o

# Many urls (both spam and ham) will strip down into the same 'base'

# url. Since we have a limited form of caching, this means that a lot

# fewer urls will have to be retrieved. However, this does mean that if

# the 'base' url is hammy and the full is spammy, or vice-versa, that

# the slurp will give back the wrong information. Whether or not this is

# the case would have to be determined by testing.

x-only_slurp_base: True

# If this option is enabled, when a message normally scores in the

# 'unsure' range, and has fewer tokens than the maximum looked at, and

# contains URLs, then the text at those URLs is obtained and tokenized.

# If those tokens result in the message moving to a score outside the

# 'unsure' range, then they are added to the tokens for the message.

# This should be particularly effective for messages that contain only a

# single URL and no other text.

x-slurp_urls: True

# It may be that what is hammy/spammy for you in email isn't from

# webpages. You can then set this option (to "web:", for example), and

# effectively create an independent (sub)database for tokens derived

# from parsing web pages.

# "x-web_prefix" is a string value that defines a prefix to be added to
tokens

# generated from a slurped URL.  This would be used if you wanted the tokens

# generated from a web page to be separate from the tokens generated from
the

# body of an email message.  For example, the config setting

# "x-web_prefix:web:" would generate a token "spambayes" if it appears in an

# email and "web:spambayes" if it appears in a slurped URL.

x-web_prefix:web:

Erik Brown

  _____  

From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of FreeMJ at HotPop.com
Sent: Sunday, October 02, 2005 6:44 PM
To: 'Herb Martin'
Cc: spambayes at python.org
Subject: Re: [Spambayes] SpamBayes to Handle Embedded Images

Herb,

OCR is probably the only sure-fire way to nail this scourge.  As far as
being resource intensive, like most other people with always-on broadband
access now, my e-mail just trickles in a little at a time.  And many/most
PCs are powerful enough to stream video now-a-days; they really shouldn't
have a problem with it being added as a feature.  It's a lot more disruptive
to manage these by hand, if you ask me.  And an OCR feature could allow
itself to be disabled, if it ended up being a performance problem for
someone.

It's gotta be done.  Now that these spammers have found an easy way to trick
these engines to be digging through meaningless text, there'll be no slowing
them without OCR.  I'm getting more and more of this style of Spam.  Easy to
install/use programs like SpamBayes have to keep up with the times, or
they'll die on the vine.  Years ago, when we mostly exchanged text-based
e-mail, it wasn't an issue.  But now, nearly all of the e-mail I receive is
HTML; and lots of it has images.

I'm ONLY using SpamBayes with Outlook 2003 (at home, where I'm having all
the trouble).  I love the easy button-based re-training!  And I don't really
care for the idea of having to add, train, and administer another layer.

Other than a miraculous OCR feature showing up in SpamBayes soon, I'm out of
ideas for a simple way of managing this type of mail on my home PC.  (Very
frustrating).

Thanks,

FMJ

  _____  

From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of Herb Martin
Sent: Sunday, October 02, 2005 12:43 PM
To: spambayes at python.org
Subject: Re: [Spambayes] SpamBayes to Handle Embedded Images

Back in April, Tony Meyer posted that he was receiving a lot of image-based
spam.

I too am having nothing but trouble with embedded images:

- Daily adds for fake Rolex watches

- Daily stock tips

- TONS of drugs for sale.

This style of Spam contains an image at the top, followed by a bunch of
totally unrelated text that has been copied from some kind of random
composition.  I have very large Spam & Ham folders, that I've successfully
trained SpamBayes with.  It's only these image-based adverts that sneak by
EVERY DAY. 

Mostly my SpamBayes catches ALL of these when anything gets this far...

 Something really needs to be done about this type of Spam within SpamBayes.
Are any other Spam engines able to handle this stuff, by scanning the image
for text, or something?

Sure, there are others (as well a SpamBayes if you just keep training EVERY
ONE of them) but most of the others are either commercial (i.e., cost money)
OR they run on the Server (SpamAssassin, greylistd, and other filters.)

There has been talk about filters which would explicitly do OCR or some
other type of image content detection but I don't (personally) know of any
that are working/available/effective right now.

Such would also likely be "resource (CPU) intensive".

FWIW, greylisting on the server knocks down practically all of this junk and
SpamAssassin catches the rest.

The VERY occasional item that slips through our server is caught by
SpamBayes.  (Defense in depth is our key to ZERO spam -- with practically
everything REJECTED, not bounced, at the server during SMTP connect time.)

And some of us DO WISH to get graphical email -- picture of my grand kid(s)
frequently arrive this way.

--
Herb Martin

  _____  

From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org] On
Behalf Of FreeMJ at hotpop.com
Sent: Sunday, October 02, 2005 1:53 PM
To: spambayes at python.org
Subject: [Spambayes] SpamBayes to Handle Embedded Images

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20051003/7eea4efe/attachment-0001.htm