[Spambayes] sharing wordlists - better numbers

Tue May 27 11:28:04 EDT 2003

On 27 May 2003 at 8:57, Skip Montanaro wrote:

> How about the next step in the exercise?  I propose that Alex and I
> (assuming Alex is amenable) each extract a non-hapax version of our word
> databases (real keys, real counts) then from that further extract a
> database from that of the most hammy and spammy words (how about <0.2 and
> >0.8?), run the usual tests against them and see how they do.  If that
> looks promising, we can merge the common words from the two, test again,
> see how big the result is, then decide whether to include it in a later
> distribution.

Are you suggesting including a "starter database" in the spambayes distribution, by 
noting which words are common to more than one person?

I wonder .. if we only used words which were "common" when determining 
spaminess, how well would that work?

Lets suppose in a "semi-shared database" mode, there was a mechanism for 
'upscaling' hapaxes into the "common word list", so that long term the collective 
wordlist would continue to evolve.

Individuals keep only their personal weights for "common words", so the database is 
split.. word-list is shared, weights are private.

Some users have gigantic lists of words, how many of them are hapaxes?

Is it possible that we could get good results by only using "common words", even if 
non-hapaxes for some users were not in the common words list? 

I suppose that depends on how close an individuals preferences are to the median of 
their group's preferences.

Do Skip and Alex have a small std deviation in their virtual group? ;-)

-- 
Brad Clements,                bkc at murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
http://www.wecanstopspam.org/                   AOL-IM: BKClements