[Python-Dev] Re: The first trustworthy <wink> GBayes results

François Pinard pinard@iro.umontreal.ca
Mon, 02 Sep 2002 08:02:55 -0400


[Tim Peters]

[... extremely good work and stuff and comments, for a good while now ...]

Hi, Tim.  I read your messages, witnessing your work and progress in that
area, with great interest, and also saved them for later contemplation! :-)

Spam always annoyed me, as most of us, and despite many efforts I did, it is
increasingly successful at traversing my filters -- so this idea of Graham or
Bayesian filters is timely and welcome.  Most previous filters I observed are
based on various (random) tests or events (you surely know all this), and
`procmail'-based filters, or even the popular SpamAssassin, are either very
slow or at least slow.  The tool I use since 1998 is much faster, especially
after I rewrote it in Python!, it is also based on various tests or events.

Your works concentrated on tuning the statistical formulas and lexical
analysis, and building operational data from preset corpora.  I'm sure all the
knowledge gleaned there will make its way everywhere, and reach me.  For a
tiny share, I decided to experiment with day-to-day user aspects of using such
a filter, and built a Gnus interface over Eric Raymond's Bogofilter.  There
are two functions to this program, one is about learning from messages known
to be ham or spam, the other is about classification of incoming messages.  By
the way, if there are Gnus users among you, just ask me for the recipe...

It goes pretty well for me, so far.  The principle, put forward by Paul
Graham, is to let the user have two delete commands: delete-as-ham or
delete-as-spam.  Eric pushed this idea a bit further by postponing learning
until the user quits the mail reader, `mutt' in his case.  As Gnus allows me
to have many mailgroups and folders and shuffle between them, I postpone
learning until the user switches mailgroups or quit, and only for the _final_
disposition of a message: that is, when a message is merely saved into another
folder, the decision will be taken when leaving that other folder, and not the
current one.  Messages marked as "saved" are _not_ sent, so to avoid double
learning.

The fact is that ham messages are more likely to be postponed than spam,
because ham is more often filed here and there.  Even if many or most ham
messages are deleted, this introduce a short term bias in the learning
statistics by which the percentage of spam seems to be higher (in my case,
1157 messages have been learned in about three days, 20% of which were spam),
but this percentage will later be lowered as filed messages get reprocessed.
Another effect is that the delay itself in ham learning may have a slight
effect on classification, but since both ham and spam are well represented,
the effect is likely negligible.

Tim corpora are surely very clean, at least by now, while day-to-day learning
may yield slightly tainted learning.  In my case, when a thread does not
interest me, I often kill all articles it contains in one command, without
opening each of them to see if it would not be spam: the threading itself
makes it unlikely.  But nevertheless possible, you surely noticed that bad
guys now fetch and re-use already published subjects as a way to get through.
That means that if big corpora are thinkable in case of mailing lists having
existed for a while, those are probably not very usable for individual users.
GBayes, Bogofilter and others should ideally resist some amount of
ham-tainted-as-spam or spam-tainted-as-ham at learning time.

After adding Graham filtering as a supplementary method to my spam detection
tool, I gladly observe that it successfully detects many spam messages which
would otherwise fall in the cracks, so it really brings something to me.  But
I also see many spam cases (are they?) it does not detect and that it would
hardly: one simple example is that _for me_, invalidly structured MIME is
indicative of an un-interesting message, as interesting people know better!

One particular problem I observed are Tim messages themselves, which are
undoubtedly very miummy ham messages, but discussing and quoting many spam
inside them.  Should these be registered as ham or spam? :-) Would not these
defeat the learning to some extent?  Where should Tim add his own messages in
the corpora he uses, and what changes would result in `GBayes' effectiveness?

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard