[spambayes-dev] Incremental training results

Thu Jan 8 20:37:42 EST 2004

> Hrm.  You don't have the X axis labeled; what units is it
> using? Days (or rather, groups) as I did?

I think so, almost.  I think that I made a mistake in the graphing and days
(groups) without any mail are skipped.  At a guess, the mail isn't all that
consistent in the 0000 to 0600 groups, and (see below) is almost all ham,
and then from 0600 to 1000 is pretty 'normal'.  If about two-thirds of the
0000 to 0600 groups aren't graphed, then the graph makes sense (in it's own
confusing way :) again.  I'll fix this for the next attempt.

> What happens at
> about 250 to pull it out of what looks like an approximation 
> of an inverse function (with all data lines overlapping) to a 
> very distinct set of separate lines?

I wondered that too, but didn't have a chance to investigate yesterday.  I
see now that it's an artefact of the data I was using - the ham I was using
goes back further than my spam.  Spam filenames start at about 0600 (except
for one oddball at 0354), while ham is from 0000.  So until that day, I was
training and classifying ham only :)  I'll fix this so that the data starts
with a ham/spam mix.  (I've been archiving ham much longer than spam, so
this just means not going back as far, although if I use all the mail that
I've kept from that period it will have much more spam than ham).

> Can you post the changes to mkgraph.py?

Will do.  (BTW I tried to find a copy of plotmtv that I could run on Windows
or under cygwin, or something else that would and could read mtv files, but
was unsuccessful - does anyone know of anything?).

> >I also had a stab at creating a regime, which might possibly be all
> >wrong :)
> 
> Your regime looks fine to me.

As least I understood something!  I guess that means that, for me at least,
that's not the training regime to use (assuming that things don't change in
my next, more informed, attempt).

> OK.  The incremental harness is built to do all 10
> classifiers at once (for the input sans each set) by default. 
> There's a command line option to do just one classifier 
> (excluding a specified set), which I always use (my machine 
> doesn't have the memory to hold all 10 classifiers at once).  
> I'm guessing that you used the former (default) behaviour... 

Yes, I did.  I'll change this, too, since it makes the running easier.  This
is the -s option, I presume.

> The 10-day span thing means
[...]
> The span plots give some idea of 'what is the performance at
> this time, as the user would experience it', whereas the 
> cumulative plots show, well, the overall numbers as they mature.

Thanks for that.  I had something like this in my head, I think, but I
understand it now.

I'll post new versions when I get them done.

=Tony Meyer