[spambayes-dev] spoof detector

David Abrahams dave at boost-consulting.com
Fri Jul 6 21:20:27 CEST 2007


on Fri Jul 06 2007, skip-AT-pobox.com wrote:

>     David> Something that comes up over and over in spam is a link of the
>     David> form:
>
>     David>     <a href="http://url/of/spammers/site">
>     David>        http://url/of/some/legit/site
>     David>     </a>
>
>     David> Does SpamBayes have a token that represents that information and
>     David> an option I can set that will use it?
>
> The SpamBayes tokenizer essentially splits the message at word boundaries,
> so the two urls are considered separately.  

Yeah, I know that's the default behavior.

> Their physical and structural proximity is not noted.  Synthetic
> tokens based on hostname or IP address in the urls will be generated
> if you add x-pick_apart_urls:True to the Tokenizer section of your
> config file.  For completeness here is my current set of tokenizer
> settings (haven't changed them in a long while):
>
>     [Tokenizer]
>     record_header_absence:True
>     summarize_email_prefixes:True
>     summarize_email_suffixes:True
>     mine_received_headers:True
>     x-pick_apart_urls:True
>     x-fancy_url_recognition:False
>     x-lookup_ip:True
>     lookup_ip_cache:~/tmp/dnscache.pck
>     x-image_size:True
>     x-crack_images:True
>     x-ocr_engine:gocr
>     max_image_size:100000
>     crack_image_cache:~/tmp/imagecache.pck

That doesn't sound like it's doing what I'm asking about.  I want a
special token that is generated each time a link's text is just a URL
and the link and the URL text don't point to the same place.  Messages
with this property are always spam and account for a large percentage
of my unsures.  No matter how much I train on them, they keep falling
into unsure, so I thought if Spambayes could actually recognize their
distinguishing feature I could easily train it to consider them spam.

>From what you say above it looks like pick_apart_urls will generate
tokens describing different parts of a given URL, but will do nothing
to help capture this particular spammy relationship between enclosed
text and actual link.

Or did I misunderstand you?

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com


More information about the spambayes-dev mailing list