[Spambayes] Mixed case words in heading
list2003 at fure.net
Sun Apr 13 12:05:07 EDT 2003
Anthony Baxter wrote:
>>>>Tim Peters wrote
>>It's *mostly* case-insensitive, and indeed to minimize database size, and
>>because tests both ways had overall indistinguishable error rates.
> With smaller training databases, case-sensitivity actually made for
> noticeably worse results.
This also made intutive sense to this spambayes newbie as well, since
case sensitivity would increase the number of words, and decrease the
statistics on each of them.
My original question was whether mixed case should be penalized:
Here is a potential pseudocode:
if ($word is unknown/doesn't occur in DataBase)
if(1 < # of Uppercase Letters < # of Total letters in word)
then $spam_rating = 0.9
This is outside the baysian approach, but would reprecent an educated
guess only for unknown words.
More information about the Spambayes