[Spambayes] Mixed case words in heading

Tim Peters tim.one at comcast.net
Sun Apr 13 02:17:48 EDT 2003


[bill parducci]
> ...
> as to the 'mixed case' issue, i believe that there have been a couple of
> different tests looking at case, etc., none of which returned
> statistical relevance. therefore, i *think* that the scoring is case
> insensitive currently (i would assume to  optimize db size).

It's *mostly* case-insensitive, and indeed to minimize database size, and
because tests both ways had overall indistinguishable error rates.
Preserving or folding away case had different effects on different kinds of
msgs, though (there are comments about this in tokenize.py -- each way is
prone to different kinds of mistakes).

Case is preserved for words in Subject lines, and for header field names
("To:" vs "TO:", etc), because tests said both those improved overall
results.  Note that all test results in the early days were on English ham,
and mostly English spam.




More information about the Spambayes mailing list