[Spambayes] RE: solution for the "spam of the future"?

Tue Dec 16 17:16:28 EST 2003

Don't start generating the "Missing: N" token until the database is large enough for it to make sense.

Bob

> -----Original Message-----
> From: Skip Montanaro [mailto:skip at pobox.com]
> Sent: Tuesday, December 16, 2003 4:50 PM
> To: Tiago Estill de Noronha
> Cc: 'SpamBayes'
> Subject: Re: [Spambayes] solution for the "spam of the future"?
> 
> 
> 
>     Tiago> Create a "meta token" that will be used everytime a word not in
>     Tiago> the database is found in the email Do the bayesian thing when the
>     Tiago> user send the email containing a new word to spam or ham from
>     Tiago> that, everytime a user gets a email with new words spambayes
>     Tiago> would classify it as ham or spam After a while receiveing those
>     Tiago> random chars emails (and building the database of know words, the
>     Tiago> token database it self) the points for new word "meta token"
>     Tiago> would increase to the spam side
> 
> Let's modify your proposal slightly.  Suppose we add a "missing: N" clue,
> where N is the number of tokens found in the message but not in the training
> database.  Otherwise, I suspect almost all mails will generate a "missing:"
> token.  (No token is generated more than once per message.)
> 
> There's a problem with either formulation.  Start with an empty training
> database.  Add one spam.  All N tokens it contains will be missing from the
> database, yielding a "missing: N" token (or maybe a "missing log(N)" token).
> Add another message, make this one ham.  It won't overlap 100% with the spam
> you just added, so it will generate a "missing: M" token.  And so on.  Early
> on, it seems your database will be polluted with a rather large number of
> missing: tokens for both ham and spam.  I think it might be difficult to
> overcome these initial training "mistakes" to turn it into a potentially
> useful clue.
> 
> Skip