[Spambayes] Mixed case words in heading

Jan Fure list2003 at fure.net
Sun Apr 13 12:05:07 EDT 2003


Anthony Baxter wrote:
>>>>Tim Peters wrote
>>
>>It's *mostly* case-insensitive, and indeed to minimize database size, and
>>because tests both ways had overall indistinguishable error rates.
> 
> 
> With smaller training databases, case-sensitivity actually made for
> noticeably worse results. 
> 
> Anthony
> 

This also made intutive sense to this spambayes newbie as well, since 
case sensitivity would increase the number of words, and decrease the 
statistics on each of them.

My original question was whether mixed case should be penalized:

Here is a potential pseudocode:
if ($word is unknown/doesn't occur in DataBase)
	if(1 < # of Uppercase Letters < # of Total letters in word)
		then $spam_rating = 0.9
	end
end

This is outside the baysian approach, but would reprecent an educated 
guess only for unknown words.

Jan Fure




More information about the Spambayes mailing list