[Spambayes] RE: solution for the "spam of the future"?
rcoe at CambridgeMA.GOV
Tue Dec 16 17:16:28 EST 2003
Don't start generating the "Missing: N" token until the database is large enough for it to make sense.
> -----Original Message-----
> From: Skip Montanaro [mailto:skip at pobox.com]
> Sent: Tuesday, December 16, 2003 4:50 PM
> To: Tiago Estill de Noronha
> Cc: 'SpamBayes'
> Subject: Re: [Spambayes] solution for the "spam of the future"?
> Tiago> Create a "meta token" that will be used everytime a word not in
> Tiago> the database is found in the email Do the bayesian thing when the
> Tiago> user send the email containing a new word to spam or ham from
> Tiago> that, everytime a user gets a email with new words spambayes
> Tiago> would classify it as ham or spam After a while receiveing those
> Tiago> random chars emails (and building the database of know words, the
> Tiago> token database it self) the points for new word "meta token"
> Tiago> would increase to the spam side
> Let's modify your proposal slightly. Suppose we add a "missing: N" clue,
> where N is the number of tokens found in the message but not in the training
> database. Otherwise, I suspect almost all mails will generate a "missing:"
> token. (No token is generated more than once per message.)
> There's a problem with either formulation. Start with an empty training
> database. Add one spam. All N tokens it contains will be missing from the
> database, yielding a "missing: N" token (or maybe a "missing log(N)" token).
> Add another message, make this one ham. It won't overlap 100% with the spam
> you just added, so it will generate a "missing: M" token. And so on. Early
> on, it seems your database will be polluted with a rather large number of
> missing: tokens for both ham and spam. I think it might be difficult to
> overcome these initial training "mistakes" to turn it into a potentially
> useful clue.
More information about the Spambayes