[spambayes-dev] Re: [Spambayes] Database cleaning?

Matthew Dixon Cowles matt at mondoinfo.com
Sat May 31 22:19:41 EDT 2003


[Alex Popiel on nonsense words in spam]
> Yes, those words cause database pollution, and yes, they can be
> weeded out with just a handful of lines of code... but it's hard to
> tell which hapax legomena will be useless, and which will soon get
> reinforced by other occurences, so it's (IMNSHO) generally not
> worth the hassle.

With an eye toward reducing the size of the database, I instrumented
the classifier a while ago and found a very strong indication that
that's true. Indeed, hapaxes often figured in scoring. I didn't
bother to calculate exact numbers because the results were strong
enough to persuade me that removing hapaxes wasn't a useful strategy.

I tore that code out and instead hacked the classifier so that I
could determine how soon after a word figures in scoring that it's
used again. I think that the results are at least slightly
interesting. Note that the histogram below is log scaled.


Unique tokens used for scoring  60627
Used Once                       17388

Days prev  Count Histogram is log scaled
        0 903644 **************************************************
        1  27694 *************************************
        2  15121 ***********************************
        3   7024 ********************************
        4   4694 *******************************
        5   3634 ******************************
        6   3134 *****************************
        7   2443 ****************************
        8   1697 ***************************
        9   1340 **************************
       10    982 *************************
       11    801 ************************
       12    671 ************************
       13    871 *************************
       14    630 ************************
       15    494 ***********************
       16    374 **********************
       17    343 *********************
       18    227 ********************
       19    216 ********************
       20    199 *******************
       21    226 ********************
       22    126 ******************
       23    114 *****************
       24     55 ***************
       25     22 ***********
       26     49 **************


My mail may not be representative in ways that exaggerate the slope
here. Specifically, I read postmaster, webmaster, etc addresses for
several domains so it's common for me to get multiple copies of the
same spam.

Regards,
Matt




More information about the spambayes-dev mailing list