[spambayes-dev] spoof detector
David Abrahams
dave at boost-consulting.com
Fri Jul 6 21:20:27 CEST 2007
on Fri Jul 06 2007, skip-AT-pobox.com wrote:
> David> Something that comes up over and over in spam is a link of the
> David> form:
>
> David> <a href="http://url/of/spammers/site">
> David> http://url/of/some/legit/site
> David> </a>
>
> David> Does SpamBayes have a token that represents that information and
> David> an option I can set that will use it?
>
> The SpamBayes tokenizer essentially splits the message at word boundaries,
> so the two urls are considered separately.
Yeah, I know that's the default behavior.
> Their physical and structural proximity is not noted. Synthetic
> tokens based on hostname or IP address in the urls will be generated
> if you add x-pick_apart_urls:True to the Tokenizer section of your
> config file. For completeness here is my current set of tokenizer
> settings (haven't changed them in a long while):
>
> [Tokenizer]
> record_header_absence:True
> summarize_email_prefixes:True
> summarize_email_suffixes:True
> mine_received_headers:True
> x-pick_apart_urls:True
> x-fancy_url_recognition:False
> x-lookup_ip:True
> lookup_ip_cache:~/tmp/dnscache.pck
> x-image_size:True
> x-crack_images:True
> x-ocr_engine:gocr
> max_image_size:100000
> crack_image_cache:~/tmp/imagecache.pck
That doesn't sound like it's doing what I'm asking about. I want a
special token that is generated each time a link's text is just a URL
and the link and the URL text don't point to the same place. Messages
with this property are always spam and account for a large percentage
of my unsures. No matter how much I train on them, they keep falling
into unsure, so I thought if Spambayes could actually recognize their
distinguishing feature I could easily train it to consider them spam.
>From what you say above it looks like pick_apart_urls will generate
tokens describing different parts of a given URL, but will do nothing
to help capture this particular spammy relationship between enclosed
text and actual link.
Or did I misunderstand you?
--
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com
The Astoria Seminar ==> http://www.astoriaseminar.com
More information about the spambayes-dev
mailing list