[spambayes-dev] More stats talk (warning, long)

Fri Sep 12 03:33:59 EDT 2003

Hello all,

I posted a message about a stats feature a while ago and have been working
on some concepts, as well as some code. Using the  old (0.3 version) I did
have something that was working for me on the source code version. I
appropriated the greyed-out "Advanced" button and made the dialog that I
posted to the list at
http://mail.python.org/pipermail/spambayes/attachments/20030813/a1d9b90f/sta
ts-0001.jpg I was reasonably happy with this and getting ready to post it
but then the GUI design changed significantly (definitely for the better!)
and so here I am.

I'm afraid I don't have quite enough python experience (or CVS experience,
for that matter, right now) to make a working patch. However, I do have some
code that I wanted to send to the list, in the hopes that someone with some
more knowledge of the GUI might fairly easily be able to slot this in to the
1.1 branch of the project. I have looked around in the dialogs.rc file and
so forth but I am only just "Learning Python" with the book of the same name
and this is a bit beyond me right now.

I'm not sure if a stats tab (there's lots of room next to the other four
tabs...) needs to be justified, but:
* there's at least some demand
* it's worthwhile to provide some sort of feedback to the user on how
effective the program is
* i like numbers ;)

Some of the stats that can fairly easily be generated from this are things
like
* total number of spam/ham/unsure/total emails found, average per day
* session totals of same since you loaded outlook
* false positives, false negatives (add to count when Recover or Delete key
is pressed, if not already in the appropriate category).
* All sorts of neat stats such as accuracy, error rate, spam recall, spam
precision, etc.
as outlined in: http://www.ics.forth.gr/~potamias/mlnia/paper_2.pdf (not all
in sample code attached but easy to calculate)

There is one potential problem, which is that I think without integrating
the stats with the database it would be difficult to have a rock-solid
accounting system for every email, for at least two reasons: people can
classify on buttons multiple times, and the existence of the "unsure"
category makes things difficult.

Conceptually, every email is either ham or spam. A false positive occurs
when a ham is categorized as spam, and a false negative occurs when a spam
is categorized as ham. I'm not too sure how to fit "unsures" into this
scheme. One possibility is to simply not count them. But, if a particular
mail is rated as "unsure", it's not a FP or FN, and it's not correct either.
Any ideas on how to handle this?

My thought (implemented in the attachment) on this was to just not count
unsure emails in the total until a categorization has occurred (as I always
train "unsure"s immediately, this discrepancy wouldn't show up in the
stats). The total emails reported would then be the actual total received
minus the number of untrained "unsure"s.

I'd personally rather there was no "unsure" category (and have occasionally
set the spam and unsure thresholds to the same number to do this), but that
would be an example of the tail (or at least a few hairs on the tail)
wagging the dog... ;)

As for the clicking-multiple times problem, using what I have, if someone
does a Delete and Recover on the same message it would count as both a FP
and FN, which can't really be true. For my purposes this seems like
something that I wouldn't worry about, although I can understand objections
here. I don't know how to get around this without a more tightly-integrated
stats concept. Anyway, I think what is suitable is the following:

* Email comes in, if ham or spam, increment appropriate count, if unsure do
nothing for now.
* When an "unsure" gets Recovered or Deleted, increment ham or spam counter
* When an email classified as ham gets "deleted as spam", increment false
negative counter
* When an email classified as spam gets "recovered from spam", increment
false positive counter.
* Unsures would be counted as they are right now, i.e. by filter.py.

My feeling is that if it goes to the Unsure folder then the user is going to
classify the message by hand. I don't really like the idea of having a third
"Unsure" category for stats purposes alongside ham and spam, since it
reflects lack of confidence by the program rather than a reality.

If you read this far, thank you! If this isn't suitable, that's fine, I
learned something anyway and had fun playing around with spambayes and
python.

Regards,

Mark Jeays

Attached are two files:
* sb-stats.txt is an expansion of the Stats class in manager.py
* sb-message.txt is some code to output the various stats (not hooked up to
any GUI), to go in a potential new StatsDialog.py (or similar)

-------------- next part --------------
# expansion of existing Stats class in manager.py

# to use this, these methods would be called in addin.py
# num_unsure, num_seen, num_spam are already incremented in filter.py

# in ButtonDeleteAsSpamEvent there would be 
# 	self.manager.stats.num_fn += 1
# if message is 'Unsure' then
# 	self.manager.stats.num_spam += 1

# in ButtonRecoverFromSpamEvent there would be
# 	self.manager.stats.num_fp += 1
# if message is 'Unsure' then
#	self.manager.stats.num_ham += 1

# in OnDisconnection, call
# 	StoreAll() 
import _winreg

class Stats:
    def get_time(self):
        # return initial time. if it's not there, initialize with current time
        try:
            temp = _winreg.QueryValueEx(self.key, "time")
            return temp[0]
        except:
            thetime = int(time.time())            
            _winreg.SetValueEx(self.key, "time", None, _winreg.REG_DWORD, thetime)
            return thetime

    def get(self, item):
    	# wrapper around QueryValueEx
        try:
            temp = _winreg.QueryValueEx(self.key, item)
            return temp[0]    
        except:
            return 0

    def StoreAll(self):
    	# store everything to registry
        #print "init_time: %d, init_seen: %d, init_spam: %d, init_unsure: %d, init_fp: %d, init_fn: %d" % (self.init_time, self.init_seen, self.init_spam, self.init_unsure, self.init_fp, self.init_fn)
        #print "num_seen: %d, num_spam: %d, num_unsure: %d, num_fp: %d, num_fn: %d" % (self.num_seen, self.num_spam, self.num_unsure, self.num_fp, self.num_fn)
        self.key = _winreg.OpenKey(self.root, self.regkey, 0, _winreg.KEY_ALL_ACCESS)
        self.store("num_seen", self.num_seen + self.init_seen)
        self.store("num_spam", self.num_spam + self.init_spam)
        self.store("num_unsure", self.num_unsure + self.init_unsure)
        self.store("num_fp", self.num_fp + self.init_fp)
        self.store("num_fn", self.num_fn + self.init_fn)
        _winreg.CloseKey(self.key)

    def store(self, item, value):
    	# wrapper around SetValueEx
        try:
            _winreg.SetValueEx(self.key, item, None, _winreg.REG_DWORD, value)

        except:
            print "Failed to set item %s with value %d in registry" % (item, value)

    def __init__(self):
        self.root = _winreg.HKEY_CURRENT_USER
        self.regkey = "Software\\Microsoft\\Office\\Outlook\\Addins\\SpamBayes.OutlookAddin"
        self.num_seen = self.num_spam = self.num_unsure = 0

        self.key = _winreg.OpenKey(self.root, self.regkey, 0, _winreg.KEY_ALL_ACCESS)
        self.init_time = self.get_time()
        self.start_time = int(time.time())
        self.init_seen = self.get("num_seen")
        self.init_spam = self.get("num_spam")
        self.init_unsure = self.get("num_unsure")
        self.init_fp = self.get("num_fp")
        self.init_fn = self.get("num_fn")
        self.num_fp = 0
        self.num_fn = 0
        _winreg.CloseKey(self.key)

        #print "init_time: %d, init_seen: %d, init_spam: %d, init_unsure: %d, init_fp: %d, init_fn: %d" % (self.init_time, self.init_seen, self.init_spam, self.init_unsure, self.init_fp, self.init_fn)

-------------- next part --------------
def getStatsMessage(self):
    # return a string with info
    output = "" 

    stats = self.mgr.stats

    emails = stats.num_seen + stats.init_seen
    spam = stats.num_spam + stats.init_spam
    unsure = stats.num_unsure + stats.init_unsure
    ham = emails - spam

    currentham = stats.num_seen - stats.num_spam

    fp = stats.num_fp + stats.init_fp
    fn = stats.num_fn + stats.init_fn

    # now spam and ham are wrong.
    spam = spam + fn - fp
    ham = ham + fp - fn

    wrong = fp + fn
    starttime = stats.init_time
    currenttime = float(int(time.time()))
    elapsedtime = currenttime-starttime
    days = float((currenttime-starttime)/86400)
    hours = (days - int(days)) * 24

    currentdays = float((currenttime-stats.start_time)/86400)
    currenthours = (currentdays - int(currentdays)) * 24

    emails = float(emails)
    spam = float(spam)
    ham = float(ham)
    fn = float(fn)
    fp = float(fp)

    output += "This session: Spam: %d, Unsure: %d, Ham: %d, Emails: %d \n" % (stats.num_spam, stats.num_unsure, currentham, stats.num_seen)
    output += "Per Day: Spam: %0.2f, Unsure: %0.2f, Ham: %0.2f, Emails: %0.2f\n" % (stats.num_spam/currentdays, stats.num_unsure/currentdays, currentham/currentdays, stats.num_seen/currentdays)
    output += "Totals: Spam: %d, Unsure: %d, Ham: %d, Emails: %d \n" % (spam, unsure, ham, emails)
    output += "Per Day: Spam: %0.2f, Unsure: %0.2f, Ham: %0.2f, Emails: %0.2f\n" % (spam/days, unsure/days, ham/days, emails/days)
    output += "This session: %d d %d h. Total days counting: %d d %d h\r" % (currentdays, currenthours, days, hours)
    output += "Incorrect evaluations: False Positives: %d, False Negatives: %d\n" % (fp, fn)
    output += "Number of incorrectly evaluated per day %0.2f\n" % ((wrong)/days)

    if emails > 0:
        output +=   "Percent of email that is spam: %0.2f%%\n" % (100*spam/emails)

    if spam > 0:
        output += "Percent correct on spam: %0.2f%% " % (100*(spam-fn)/(spam))         
        if fn > 0:
            output += "(1 in %d spam was misclassified)\n" % (spam/fn)
        else:
            output += "(None misclassified!)\n"

    if ham > 0:
        output += "Percent correct on ham: %0.2f%% " % (100*(ham-fp)/(ham))
        if fp > 0:
                output += "(1 in %d ham was misclassified)\n" % (ham/fp)
        else:
                output += "(None misclassified!)\n"

    if emails > 0:
        output += "Percent correct on all emails: %0.2f%%\n" % (100*(emails-wrong)/(emails))

    output += "(Unsure not counted as spam or ham, or in totals)"

    return output