[Spambayes] sharing wordlists - better numbers

Skip Montanaro skip at pobox.com
Tue May 27 10:36:31 EDT 2003


    Brad> Are you suggesting including a "starter database" in the spambayes
    Brad> distribution, by noting which words are common to more than one
    Brad> person?

I thought that was the direction you were headed with this exercise.
I guess I misunderstood.

    Brad> I wonder .. if we only used words which were "common" when
    Brad> determining spaminess, how well would that work?

Should work pretty well if we include "common" words which turn out to be
strong spam or ham indicators for a suitable cross-section of the group.

    Brad> Lets suppose in a "semi-shared database" mode, there was a
    Brad> mechanism for 'upscaling' hapaxes into the "common word list", so
    Brad> that long term the collective wordlist would continue to evolve.

Alex and I both have fairly large word databases.  I suspect hapaxes will
remain hapaxes.  I'm thinking of just a starter database of a reasonable
size.  It could be shipped in plain text form then installed using Tim
Stone's (I believe) database importer/exporter tool.

    Brad> Do Skip and Alex have a small std deviation in their virtual
    Brad> group? ;-)

I realize there's a smiley, but what do you mean by "virtual group"?

Skip



More information about the Spambayes mailing list