[spambayes-dev] RE: [Spambayes] Watch out for digests...

Wed Dec 10 21:22:59 EST 2003

    >> Big mistake. Stuff started getting wacky real fast.... Guess what?
    >> One of the messages in the digest was an obvious spam.

    Tony> This is perhaps a drawback of the minimalist database size
    Tony> training strategy.  I'm guessing that if you had a larger
    Tony> database, the effect wouldn't have been as pronounced?  

Maybe.  At the moment, I have 9768 tokens in my database and 7731 of them
are hapaxes.  As you suggest, it would appear mistakes can throw things off
more dramatically, but it is also easier to detect.

I'd be interested to see what others' hapax fractions are:

    >>> import shelve
    >>> db = shelve.open(".hammiedb")
    >>> n = 0
    >>> len([k for k in db if db[k] in [(0,1),(1,0)]])
    7731
    >>> len(db)
    9769
    >>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1)
    0.79146191646191644

(The -1 is to eliminate the 'saved state' token.  I'm just being
pedantic. ;-)

Another interesting thing (I think) might be to investigate the importance
of synthetic tokens (e.g.: 'url:eweek' or 'received:168.10.156') vs. natural
tokens (e.g., 'highlight' or 'dot') for smaller vs larger databases.  I
think one of the reasons training a single unsure has a dramatic effect on a
bunch of other unsure spams is because of all the synthetic tokens they have
in common due to similar delivery mechanisms (gotta use that account before
it gets shut down...).  If a spammer spews a bunch of messages from ISP A,
then gets booted, his next spew will be from somewhere else.  I suspect many
of the ISP-related synthetic tokens generated will only ever be hapaxes, and
thus be much more important with a small database than with a large one.

It's just a theory.  Hey, maybe that's another master's thesis idea for
Brett Cannon... ;-)

Skip