I would love to see how the results would be affected by applying the scoring scheme to the entire content of the message, instead of just the 15 (or 16 in your case) most extreme samples.
Then it would be close to a classic Bayesian classifier, and like any such would need entirely different scoring code to avoid catastrophic floating-point errors (right now an intermediate result can't become smaller than 0.01**16 = 1e-32, so fp troubles are impossible; raise the exponent to a measly 200 and you're already out of the range of IEEE double precision; classic classifiers word in logarithm space instead for this reason). You can read lots of papers on how those do; all evidence suggests they do worse than this scheme on the spam versus non-spam task.
By the way, you never said why you increased that number by one;
It's explained in the comment block preceding the MAX_DISCRIMINATORS definition.
BTW, in an unreported experiment I boosted MAX_DISCRIMINATORS to 36. I don't recall what happened now, but it was a disaster for at least one of the error rates.
did it make that much difference?
Not on average. It helped eliminate a narrow class of false positives, where previously the first 15 extremes the classifier saw had 8 probs of .99 and 7 of .01. That works out to "spam". Making the # of classifiers even instead allowed for graceful ties, which favor ham in this scheme. All previous decisions "should be" revisited after each new change, though, and in this particular case it could well be that stipping HTML tags out of plain-text messages also addressed the same narrow issue but in a more effective way (without some special gimmick, virtually every message including so much as an example of HTML got scored as spam).