[spambayes-dev] SpamBayes for Document Categorization?

Fri Jul 22 20:33:26 CEST 2005

Hello,

I am interested in using SpamBayes as the core classifier for a system I 
want to write that classifies document instances into categories. Instances 
might be formatted in Word, PDF, text, or html. Of course I don't expect 
SpamBayes to know how to read all these different formats. So for the sake 
of discussion, let's just say it could process the text of any document I 
throw at it. 

For the sake of discussion, let's say I have five categories with many 
document instances (training examples) from each of these five categories:

Doc Category #1 - streaming media protocols
Doc Category #2 - media format conversion tools 
Doc Category #3 - DirectShow
Doc Category #4 - media content management systems
Doc Category #5 - none of the above

In my proposed system I drag a document instance into a watch folder, which 
causes a text classifier to open it, analyze it and "tag" it somehow to 
indicate to which of the five categories it belongs (say by moving it into 
one of five directories). 

Here are my five concerns.

*1. Embedding the SpamBase code into my app.
*
My first concern is whether or not the SpamBayes training and classifier 
code is structured such that it can be embedded into this kind of tool. I'm 
pretty comfortable with Python. But rewriting major pieces of SpamBayes to 
do this app would not be fun, nor feasible. 
 *2. SpamBayes for Non-Email-Types of Classification.*
 Does it even make sense to start with SpamBayes since my problem domain 
doesn't have anything like email headers or the presence of an attachment, 
etc. that SpamBayes probably uses in its core feature extraction?
 *3. Discriminatory Training*

My next concern relates to the lack of discriminatory training between 
categories. I think the way SpamBayes works is my training on a particular 
class, say class 1, is building a model with which to make the 
discrimination: Is this document instance a member of class 1 or not? When I 
train the model for class 1 do I only include positive instances (the ham) 
of Category 1? Or do I also include negative instances from the other 
categories (spam)? 
 If the model for Category-1 is only trained on positive instance from that 
category, then this trained model is independent of the trained models for 
categories 2 through 5. And when it comes time to make a classification the 
model that responds "loudest" is the one selected. But, and here's my 
concern, there has never been a proabability model created that *discriminates 
between* the categories. Does this make sense what I am describing? I guess 
I'm thinking about Maximum Likelihood training of acoustic models in a 
speech recognition system, which has this lack of discriminatory training 
and I'm wondering if multi-class naive-Bayes classifiers have this same kind 
of shortcoming. 
  *4. Adding a New Document Category.*
 Let's say I have trained the models on my five classes (as described above) 
and everything is working fine and I decide to add a new document category. 
Do the first five models need to be trained from scratch (to include the 
negative instances in this new sixth category)? Or can SpamBayes models be 
"incrementally" trained by just training on these new class-6 negative 
examples? 
 *5. Size of Training Sample*

My final concern relates to the number of training documents I would need. 
I'm guessing that each of my documents, no matter how long or short, reduces 
to a single feature vector for training and classification. Is this correct? 
If so, it would seem that I would need *at least* hundreds of examples from 
each category and probably thousands. Yes? No?
  Thanks for any thoughts you might have on my concerns and questions.
 ~Michael.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20050722/95a8332b/attachment.htm