[Spambayes] Tough to classify

Tim Peters tim.one at comcast.net
Sun Apr 13 03:02:50 EDT 2003


[David Shaw]
> I placed an order with Amazon today.  I got a TiVo and a Java book.
> The order confirmation came back unsure, with 123 clues pointing both
> ways, and probability as follows:
>
> *H* 0.981571864474
> *S*	0.56420331545
>
> This message is obviously ham to a human,

I have no doubt that it was obviously ham to you, but don't accept it would
have been obvious ham to humans other than you.  For example,

"""
 <<Confirmation of Order.html>>
Thanks for ordering from Gateway!  Please see the attached file for the
details of your order.  Should you wish to add additional items, or have any
questions, please reply to me or call me at the number listed inside.
Refer a friend!  You may qualify for a $50 credit when you refer a friend
and they buy a Gateway PC.  See this page for details
http://www.gateway.com/programs/rewards/index.shtml
"""

is the text of an order confirmation I got from Gateway last month.  I dare
say it's spam to everyone else on this list <wink>.

> but here are some of the higher spam clues:
>
> find,	0.908163265306
> day?	0.908163265306
> $5,000.	0.908163265306
> 20,	0.908163265306
> url:help	0.908163265306
> telephone:	0.934782608696
> order:	0.934782608696
> buy	0.942237128563
> saver	0.949438202247
> seller	0.96511627907
> online,	0.983271375465
> ordering	0.987106017192
> dollar	0.987106017192
> grand	0.988431876607
> shopping	0.992091388401
> tax	0.994699646643
> subject:with	0.99504950495
> subject:Your	0.997366881217
>
>
> What can be done in a case like this?

Training on it will be effective, over time.  As it says on

    http://spambayes.sourceforge.net/background.html

    For example, commercial HTML email from a company you do business with
is
    quite likely to score as Unsure the first time the system sees such a
message
    from a particular company.  Spam and commercial email both use the
language
    and devices of advertising heavily, so it's hard to tell them apart.
Training
    quickly teaches the system all sorts of things about the commerical
email you
    want, though, ranging from which company sent it and how they addressed
you,
    to the kinds of products and services it's offering.

and, e.g., "$5,000." is either some advertising gimmick, or you paid waaaay
too much for a Java book <wink>.

> I don't order from amazon that often (maybe 4 times a year), but amazon
itself
> is a ham clue:
>
> url:amazon	0.155172413793

You must have many ham clues, else your *H* score wouldn't have been 0.98.

> I feel like spambayes has enough clues to know this is ham, it's just a
> question of calculating the probability in such a way as to recognize
> it.  I would be interesting in any thoughts on this.

There are many ways to combine the individual word spamprobs so that the msg
will come out as ham.  The trick is to do so in a way that doesn't also
classify more spam as ham.  The combination method in spambayes is the end
result of some intense work on the topic by several people, and beat about a
dozen other combination methods in large tests.  That doesn't mean it's the
best possible combination method, but does suggests it won't be trivial to
do better.

The combination code (in classifier.py) is about the easiest part of the
system to change, so feel encouraged to test alternatives.  "I feel like"
isn't really testable on its own <wink>.




More information about the Spambayes mailing list