[Spambayes] solution for the "spam of the future"?

Skip Montanaro skip at pobox.com
Tue Dec 16 16:49:45 EST 2003


    Tiago> Create a "meta token" that will be used everytime a word not in
    Tiago> the database is found in the email Do the bayesian thing when the
    Tiago> user send the email containing a new word to spam or ham from
    Tiago> that, everytime a user gets a email with new words spambayes
    Tiago> would classify it as ham or spam After a while receiveing those
    Tiago> random chars emails (and building the database of know words, the
    Tiago> token database it self) the points for new word "meta token"
    Tiago> would increase to the spam side

Let's modify your proposal slightly.  Suppose we add a "missing: N" clue,
where N is the number of tokens found in the message but not in the training
database.  Otherwise, I suspect almost all mails will generate a "missing:"
token.  (No token is generated more than once per message.)

There's a problem with either formulation.  Start with an empty training
database.  Add one spam.  All N tokens it contains will be missing from the
database, yielding a "missing: N" token (or maybe a "missing log(N)" token).
Add another message, make this one ham.  It won't overlap 100% with the spam
you just added, so it will generate a "missing: M" token.  And so on.  Early
on, it seems your database will be polluted with a rather large number of
missing: tokens for both ham and spam.  I think it might be difficult to
overcome these initial training "mistakes" to turn it into a potentially
useful clue.

Skip



More information about the Spambayes mailing list