[Spambayes] training WAS: aging information

Tue Feb 18 18:56:38 EST 2003

On 18 Feb 2003 at 12:48, Tim Stone - Four Stones Expressions wrote:

> >As far as I remember, it keeps the last 7 days....
> 
> This is true.  If you pay no attention, stuff goes away after 7 days.
> 

That's definitely worth knowing. Thanks. 

> >
> >> In any case, I'm trying to figure out whether it's possible to save
> >> myself the increasingly-annoying chore of going to the web interface
> 
> An idea that we toyed with, and even made a prototype implementation,
> was to include an smtpproxy in the mix.  With that, you could train by
> forwarding a mail to spam@ or ham at .  This was very convenient, and
> eliminated much of the 'increasingly-annoying chore' you refer to (which
> incidentally is part-and- parcel of bayesian (machine learning)
> algorithms).  The problem with using an smtpproxy is that most mailers
> mess around with the headers. Some of them even lop almost all of them
> off.  There are many important clues in the headers, and these clues are
> simply missed by this mechanism.  So we chose to cache incoming mail and
> give a user interface, so training could be done on the intact mail.
> 
> But you bring up an interesting point, in that it's very possible that
> having to train will be viewed as an annoying chore by many people.  The

What was really concerning me was that I had seen no indication that it 
was permissible simply to stop training -- and that if I did so, the 
system wouldn't just store incoming e-mails forever.

So the first stop-gap solution is simple: somewhere state clearly that 
once the filter is working to a user's satisfaction, the user can stop 
training. 

Then the problem will be what to do when a spam gets through (or a ham 
doesn't). Obviously (if anything is truly obvious) the user will want 
to train on that one particular mail. The current interface would make 
this a nightmare -- There am I sitting with 7 days worth of e-mail 
(which in my case would be something like 1500 messages) and I want to 
find the one that has been misclassified.

So it seems to me that there has to be something like the smtpproxy 
thing. But then I'm biased: my MUA doesn't delete headers. (Actually, I 
was unaware that any mailers did that sort of thing; but I readily 
admit that I'm a naïve rustic.)

> smtpproxy might provide a much more convenient training mechanism. 
> We've also toyed with the idea of providing pretrained databases, so
> people don't have to start training from scratch.  Of course, the

I don't really like that idea very much. I'm trying to come up with a 
logical explanation for that feeling, though, and not doing very well. 
This is the best I can do:

I am impressed at how quickly spambayes has moved toward near 100% 
accuracy on my system. (So far today it has classified a single spam as 
unsure; everything else has been classified correctly.) If I had 
started from a pre-seeded database, it isn't at all clear that it could 
have converged to my idea of spam as quickly as starting from an empty 
database. Obviously, the experiment could be done to see if it really 
is worth it, but I suspect that all of us have better things to do than 
to grab a ton of spam and build some filters. Maybe I'm wrong. I 
frequently am :-)

> stop even doing this when I'm satisfied with my fp/fn rate, and will
> then only train on mistakes, and occasionally on correctly classified
> stuff to be sure things don't get out of whack.  - TimS
> 

I saw a comment in the LJ article that one should train on roughly 
equal numbers of spam and ham. Is this actually true? (This question of 
course merely demonstrates that I'm too lazy to do the maths myself.)

One thing I've learned by doing the training is that approximately 10% 
of my mail is spam. I'm surprised, because I would have guessed that 
the proportion was lower than that. I guess that I had got to the point 
where I mentally just filtered it out of consciousness as I clicked the 
"delete" button every morning on the night's accumulation of the stuff.

I really am going to have to try to find time to do the aging thing, 
though. I want to experiment with classifying off-thread postings to 
reflectors as spam :-) I suspect that it won't work very well, but the 
experiment seems like it's worth a try.

  Doc
--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------