I must be missing something. I would *hope* that you review *all* messages claimed to be spam, in which case the number of msgs to be reviewed would, in a perfectly accurate system, be equal to the number of spams received.
Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day.
(I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.)
I get about 200 spams per day on my own email accounts, and look at all of them. I don't look at the headers at all, I just look at the msgs in a capable HTML-aware mail reader, as a matter of course while dealing with all the day's email. It's rare that it takes more than a second to recognize a spam by eyeball and hit the delete key. At about 200 per day, it's just now reaching my "hmm, this is becoming a nuisance sometimes" threshold. Our tolerance levels for manual review seem to differ by a factor of 100 or more <wink>.
Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection.
I can tell you the mean false negative and false positive rates on what I've been working on, and even measure their variance across both training and prediction sets. (The fn rate is well under 2% now (adding in more headers should improve that a lot), and the fp rate under 0.05% (but I doubt that adding in more headers will improve this)). So long as we don't know the rates for the scheme you're using now, there's no objective basis for comparison.
Maybe you don't want this kind of approach at all. The classifier
have "gray areas" in practice: it tends to give probabilites near 1, or near 0, and there's very little in between -- a msg either has a preponderance of spam indicators, or a preponderance of non-spam indicators.
That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.)
However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside?
Under Graham's scheme, send it on. It doesn't have grey areas in a useful sense, becuase the scoring step only looks at a handful of extremes: extremes in, extremes out, and when it's wrong it's *spectacularly* wrong (e.g., the very rare (< 0.05%) false positives generally have "probabilties" exceeding 0.99, and a false negative often has a "probability" less then 0.01).
Just how many messages fall in that grey area anyways?
I can't get at my testing setup now and don't know the answer offhand. I'll try to make time tonight to determine the answer. I guess the interesting stats are what percent of hams have probs in (0.1, 0.9), and what percent of spams. In general, it's only very brief messages that don't score near 0.0 or 1.0, so this *may* turn out to be the same thing as asking what percentages of hams and spams are very brief.
Note too that adding the headers in *should* catch a lot more spam under this scheme. But, even as is, and even if I strip all the HTML tags out of spam, fewer than 1 spam in 50 scores less than 0.9. The ones that are passed on now include all spams with empty bodies (a message with an empty body scores 0.5).