[spambayes-dev] "approximately" the same size

Skip Montanaro skip at pobox.com
Fri Jan 21 21:12:13 CET 2005


When we tell people not to let their ham/spam imbalance get too bad, we are
referring to the number of messages trained.  There is another way to look
at this imbalance though: number of tokens generated from each stream.  For
me, ham messages are much larger on average than spam messages.
Consequently, for roughly the same number of tokens to come from each
stream, I need more spams than hams.  Is there some way to tell how this
might affect scoring?  Is it relevant to the scoring?

ATM, I have nearly three times as many spams as hams in my training set:

    % egrep '^From ' newham.old | wc -l
          93 
    % egrep '^From ' newspam.old | wc -l
         267 

but the hams contribute approximately the same number of unique tokens as
the spams:

    >>> from spambayes import mboxutils, tokenizer
    >>> hs = set()           
    >>> ss = set()
    >>> for msg in mboxutils.getmbox("newham.old"):
    ...    hs |= set(tokenizer.tokenize(msg))
    ... 
    >>> for msg in mboxutils.getmbox("newspam.old"):
    ...    ss |= set(tokenizer.tokenize(msg))
    ... 
    >>> len(hs)
    20360
    >>> len(ss)
    24734

Most tokens are unique to one set or the other:

    >>> len(ss & hs)
    5205
    >>> len(ss - hs)
    19529
    >>> len(hs - ss)
    15155

Skip


More information about the spambayes-dev mailing list