Wed Aug 20 15:36:08 EDT 2003

Feature Requests item #783846, was opened at 2003-08-06 00:31
Comment added
Nicolas Kruchten (simnick66)
Summary: Generalize!

Initial Comment:
I think that the spambayes engine (which works great 
btw) could be generalized to classify way more than just 
spam, it could completely replace rule-based filtering.

As i understand it (correct me if i'm wrong!) there is 
nothing spam-specific about this software, it could be 
used to classify any message into any folder, while 
retaining its spam-filtering abilities.

the way this would work would be to specify a default 
folder (the inbox) and then a set of other folders (ie 
spam, folder1, folder2, folder3) and when a message 
comes in, it could be shuttled to the appropriate folder if 
the threshold for that folder's bayesian rating is crossed. 
if multiple thresholds are crossed, it could then go into 
an unsure folder, or the values could be compared to 
see where it should go. each folder could have 
independently set thresholds...

is there an easy way to hack the current outlook plugin 
to do this for testing purposes? (ie to install multiple 
versions and spoof them into comparing different folders 
or... ?)


>Comment By: Nicolas Kruchten (simnick66)
Date: 2003-08-20 21:36

Logged In: YES 

ok, i'll try it, thanks!


Comment By: Tony Meyer (anadelonbrin)
Date: 2003-08-19 08:40

Logged In: YES 

You can now try Skip's n-way.py script in the contrib 
directory in cvs, which does the cascade of n spambayes 
classifiers as Tim describes.


Comment By: Tim Peters (tim_one)
Date: 2003-08-06 00:51

Logged In: YES 

The spambayes engine doesn't really know anything 
about "ham" and "spam", it starts for each user as a blank 
slate and knows only what they teach it.  So in that sense it 
could be more general, although the tokenization strategies 
were specifically designed for, and tuned on, the ham-vs-
spam task.

A higher hurdle is that spambayes is "deeply" a binary (two-
way) classifier.  Classic Bayesian classifiers are N-way, but 
spambayes threw away that part of the math, replacing it 
with a two-way classifier based on very different theory.

I know at least one person tried to do N-way classification by 
training N distinct spambayes classifiers, and running them in 
sequence.  That was quite a while ago, and I don't know how 
happy they were with the results.

In any case, trying to hack the Outlook plugin will have you 
crawling on your belly for a month <wink -- but dealing with 
Outlook is massively complicated>.


