[Spambayes] sharing wordlists - better numbers

Brad Clements bkc at murkworks.com
Tue May 27 13:15:30 EDT 2003


On 27 May 2003 at 8:02, bill parducci wrote:

> this is a very interesting idea, but after working it through in my 
> head, it doesn't seem to offer an architectural improvement over the 
> existing system (even for 7000 users). this is not to say that the 
> pursuit of commonality, etc. won't bear fruit down the road, but that i
> personally don't think the original intent will be served.
> 
> just my two cents...

You expressed this sentiment last week, so I think you're up to 4 cents now. ;-)

My excuse continues to be, lets pass the first stage before worrying about the 
technical issues of deployment. We may never get that far anyway.

Another thought.. In the case of 7000 users, how many are really going to bother to 
train? We know that a single person's weights probably don't speak for the whole 
community, but does an average of weights of a few members of the community 
represent the average of the weights of the entire community?

In other words, for those orgs who want some control over their spam, could the 
average weighting of 10 members out of 1000 reasonably represent the average of 
all 1000 members?

Heh, I know there's a technical name for this.. the mean of a sub-sample approaches 
the mean of the entire sample .. something like that.

So I'm thinking .. suppose you allow people to keep their private weights, but for 
those who just want "good enough" filtering, they use a "synthesized database" which 
represents the "average" of the private database weights.

Do you average the word weights across private databases before scoring, or do you 
average the scores?

Just musing..


-- 
Brad Clements,                bkc at murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
http://www.wecanstopspam.org/                   AOL-IM: BKClements




More information about the Spambayes mailing list