[spambayes-dev] More stats talk (warning, long)
Mark Hammond
mhammond at skippinet.com.au
Fri Sep 12 22:41:22 EDT 2003
> I'm afraid I don't have quite enough python experience (or
> CVS experience, for that matter, right now) to make a working
> patch. However, I do have some code that I wanted to send to
> the list,
This is a good start - thanks! If you can, there are still some thing you
can do to help us slot this in.
Regarding CVS experience - my suggestion is that you simply start with a
decent "diff" program, and ignore CVS. Almost any "diff" should do :) Take
a copy of the files you are editing (say - addin_orig.py), and before you
thing you have a nice set of changes, run:
C:\> diff -u addin_orig.py addin.py > addin.patch
Then you can review addin.patch, and check that all your changes still make
sense now it is all finished :) I'd be happy to accept addin.patch and
integrate it in. Slowly we will get to adding a "cvs" to the start of that
command :)
I'd suggest you do the following:
* Create a new stats.py file - this would be similar to the attached
sb-stats.txt
* Add the following methods to the "Stats" class:
- RecordFilterAction(self, dispostion) # ie, "Yes", "No", "Unsure" or
"Error"
- RecordRecoverAction(self, recover_type)
* Have filter.py and addin.py call these methods. All counters etc are
managed internally by the class, rather than externally set as filter.py
does now.
* Add a method "GenerateReport()" to the "Stats" class - this would return a
string - similar to sb-message.txt
* Mail me "stats.py", an "addin.patch", "filter.patch" and any other .patch
file that becomes necessary.
This way we will have an excellent start, and the rest will eventually fall
into place. "The rest" will then consist mainly of changes to "stats.py"
> There is one potential problem, which is that I think without
> integrating
> the stats with the database it would be difficult to have a rock-solid
> accounting system for every email, for at least two reasons:
> people can
> classify on buttons multiple times,
We manage this already, so that is no problem.
> and the existence of the "unsure"
> category makes things difficult.
>
> Conceptually, every email is either ham or spam. A false
> positive occurs
> when a ham is categorized as spam, and a false negative
> occurs when a spam
> is categorized as ham. I'm not too sure how to fit "unsures" into this
> scheme. One possibility is to simply not count them. But, if
> a particular
> mail is rated as "unsure", it's not a FP or FN, and it's not
> correct either.
> Any ideas on how to handle this?
>
> My thought (implemented in the attachment) on this was to
> just not count
> unsure emails in the total until a categorization has
> occurred (as I always
> train "unsure"s immediately, this discrepancy wouldn't show up in the
> stats). The total emails reported would then be the actual
> total received
> minus the number of untrained "unsure"s.
Yes, I think you are correct. We simply ignore unsure for certain stats.
However, the key thing to do is get the start of a little "interface" all
setup, and get even basic stats going. My pathetic excuse for stats
prompted you to write this. One small step at a time, as long as it in the
right direction, will get there.
> As for the clicking-multiple times problem, using what I
> have, if someone
> does a Delete and Recover on the same message it would count
> as both a FP
> and FN, which can't really be true. For my purposes this seems like
> something that I wouldn't worry about, although I can
> understand objections
> here. I don't know how to get around this without a more
> tightly-integrated
> stats concept.
We can integrate our stats concept as tightly as we want - it just means a
few more ".patch" files to attach :) We would be able to keep these
counters in synch without too much trouble, and we don't have to get it
right the first time.
> If you read this far, thank you! If this isn't suitable,
> that's fine, I
> learned something anyway and had fun playing around with spambayes and
> python.
I think it is all suitable, and I hope you keep playing and learning. If
you can kep pushing this along as I mention above, then it will all happen
fairly quickly. Before you know it you will be firing off patches too fast
for me to keep up with :)
Mark.
More information about the spambayes-dev
mailing list