[Spambayes] progress on POP+VM+ZODB deployment

Tim Peters tim.one@comcast.net
Mon Oct 28 05:49:08 2002


This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Derek Simkowiak]
> 	Thanks for your responses today, they've been very helpful.  I
> hold out hope, though, that one day the heuristics will be available to
> magicallly know what I want :)
>
> 	Seriously speaking, my gut says that all the information we need
> is in the spam collections like Bruce's.  Somebody just needs to figure
> out how to mine it, methinks.

I think you're missing a basic point, but not due to lack of repetition
<wink>.  Let's get real concrete.  Go to this site:

    http://www.esmokes.com/

I buy cigarettes from that site, and I get glitzy HTML promotional email
from them about once a week.  I want that email.  You don't (or so I guess),
and *any* filter trained on any spam collection on Earth is-- if it's worth
anything at all --going to say that's spam.  I'll attach one of their emails
for your perusal.

This isn't a question of classification technology so much as it's a
question of personal preference, and so long as you're determined that
everyone must use the same classifier, personal preference goes out the
window.  That's a bad use of technology, IMO -- I'm not interested in
treating everyone like interchangeable cogs.  Buy a server with enough disk
space so everyone can have their own classifier, and do whatever else it
takes to give people a system they'll truly love instead of merely endure.

The spambayes system had no trouble learning that *I* want this crap because
it found many lexical clues nearly unique to email from this particular
vendor:

'esmokes.com'                  0.0412844
'subject:eSmokes.com'          0.0412844
'url:esmokes'                  0.0412844
'carton'                       0.0505618
'esmokes.com!'                 0.0505618
'esmokes.com,'                 0.0505618
'from:email addr:esmokes.com'  0.0505618
'message-id:@mail-server'      0.0505618
'url:brandid'                  0.0505618
'url:side'                     0.0505618
'url:template1'                0.0505618
'url:vadcamp'                  0.0505618
'carton!'                      0.0652174

and there are about 25 other low-spamprob words (but with higher spamprobs
than those) common in this vendor's (but no other's) email too.  The
detection of other kinds of spam wasn't injured at all.  Even so, email from
this place still scores between 0.03 and 0.17 for me (which are high ham
scores under chi-combining, but well within my "I'm sure it's ham" range).

>>  I'd like to plead for world peace too, while the algorithm geniuses
>> are at it <wink>.

> 	Pfft, that one's easy.  It's the implementation that kills ya! :)

In my case, it will be the cigarettes <wink>.

---------------------- multipart/mixed attachment--