[spambayes-dev] Enhanced Outlook statistics display

Meyer, Tony T.A.Meyer at massey.ac.nz
Thu Dec 9 22:49:30 CET 2004


[Tony talking about the weighted cost calculation]
> That's another possibility, although it would
> probably be more difficult to compare against other
> spam filters (especially if anyone did adjust the
> weights).

Yes - if it was to be used to compare then the weights would have to be agreed on in advance.

> John's main point in his "batting average" article was that a
> single accuracy score makes it difficult to see the difference
> between filters that reduce false positives by letting though
> a lot of spam vs. filters that kill almost all of the spam at
> the expense of increased false positives.  By reporting the scores
> separately, the user can make the tradeoff based on what is more
> important to them.

The cost does this as long as the weights are correct for that user, though.  e.g. if I *hated* fp's, didn't care at all about unsures, and hardly cared about fn's, I could have weights of (eg) 100.0, 0.0 and 0.1 (respectively) and the score would reflect what was important to me.  Kinda like John's method of dividing the two numbers into each other, but better.

If the user (or reviewer, or whatever) is able to understand having two (or four!) numbers, then that's better, though.  Comparing filters is hard for many other reasons, anyway (training regime, mail stream, etc)

[consolidating stats code]
> That would be good, but difficult currently because
> they take entirely different approaches.  The Outlook
> addin totals up the stats as it goes,
> while sb_server recalculates them by iterating through
> the data in the messageinfo database.

I had forgotten about some of this, although I was thinking about a higher level consolidation taking the raw counts, as you suggest.

> Maybe the changes you made to utilize the same
> messageinfo database for Outlook will allow us to
> calculate the Outlook stats the same way. 

That's an interesting idea.  It would save us having the separate database.  I've wondered (since I wrote the web interface method) whether it would get really slow as the db increases in size, since it iterates through the whole thing each time the stats are generated.  I should have a play around and see if that is going to be a problem or not (if so, maybe some sort of middle ground between the methods can be found that both systems can use).

>> What do you think about the stats that are requested in the tracker?
> Are you refering to RFE #765924 regarding breaking down the stats by
> hour/day/week, etc?  That seems like a lot of work for a questionable
> value, especially since we would probably have to store a bit more
> data in messageinfo to allow it.

Sorry, that was rather vague.  Yes, I did mean that RFE.  Those were my thoughts too.  Maybe a little script that just printed out the current stats would be sufficient - if someone really wanted daily/whatever stats, they could just set up some utility to call that script at the appropriate interval.  The number classified would say how much mail was received in that period, and you could probably extract that rest from it.  Without any more demand, though, I'm inclined to leave it.

[Reset stats button]
> Should be easy enough, I'll take a look. 

Thanks :)

> It would probably be nice to save the date when the
> statistics were last reset, as well.

Good idea.

> I haven't done much with pickles.  Is that something
> that could be easily added to the stats file?

>From memory (I don't have access to the code from where I am at the moment), the pickle is just a dict that gets saved.  So you could just add another value ('stats["RESET_DATE"] = date' or something) and it would get saved.

However, I had forgotten until reading your message about the differences between how the web interface and Outlook go about it.  If it is now possible for the plugin to use the messageinfo db, then maybe we don't need the stats pickle any more.  We could store a classified_date (and trained_date?) in the messageinfo db easily enough, and then only pull the data we want (adding a 'current stats starting point' value too, I guess).  I'll think about this and have a look at how quickly the db is going to increase in size (it's already going to be larger than the old version).

=Tony.Meyer


More information about the spambayes-dev mailing list