[Spambayes] Ideas for an MSc project please...

Wed Feb 4 20:50:17 EST 2004

> Does anyone have any ideas for a project that I would like to 
> do at Masters level?  I want to do something on Bayesian filters
> because it is a very interesting idea.  Is there some way of
> improving it?  what could someone research on in this field?

What sort of background do you have?  If you've got a good understanding of
the statistics underneath SpamBayes (I don't! <wink>), then you could try
coming up with variants of the classifier that handle the ham/spam imbalance
problem better.  (In short, the SpamBayes math works best with equal numbers
of ham and spam, and can fall apart with wildly imbalanced numbers.  The
only attempt so far to mathematically counter for this was a failure).

There's probably various n-way classifying stuff that you could look at,
although POPFile is probably a better place to start than SpamBayes for this
(but see the n-way.py script in the contrib directory), and maybe they have
it all sorted; I don't know.

You could look into the effects of message/token expiry - how it effects the
math, and how it effects the results, and how SpamBayes could most
effectively do it.

If you have a background in a non-English (especially Asian) languages, then
you could look into adapting SpamBayes to work with those.  (Splitting on
whitespace, for example, which is at the heart of the SpamBayes tokenizer,
is highly unlikely to work in that situation).  If this was successful, then
it'd probably be worth forking SpamBayes off to create a version for this.
(If you do this, take a look at the patch that's open that looks at this,
although (IIRC) it doesn't really alter the tokenizing scheme much).

Another multi-language idea (not necessarily Asian) would be to look at
bilingual (or trilingual, etc) corpora and seeing if there are better ways
of dealing with that.  (For example, although this is naïve and unlikely to
work, translating the tokens into a base language).

You could look into non-email filtering.  Web pages has already been
suggested, but you could also look into chat/sms/newsgroup/rss spam
filtering or something like that.  I've heard that spam sms and instant
messaging is on the rise, but haven't experience that myself (but then I
hardly use IM, and NZ is probably somewhat isolated from spam sms).

You could see if using natural language processing techniques could generate
useful tokens, without sacrificing too much in the way of speed and database
size (there's bound to be some tradeoff).  There's a Python natural language
processing library (<http://nltk.sourceforge.net>), which would be a good
start.

I could go on and on, but won't <wink>.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.