[spambayes-dev] SpamBayes for Document Categorization?
michaelmurdock at gmail.com
Fri Jul 22 20:33:26 CEST 2005
I am interested in using SpamBayes as the core classifier for a system I
want to write that classifies document instances into categories. Instances
might be formatted in Word, PDF, text, or html. Of course I don't expect
SpamBayes to know how to read all these different formats. So for the sake
of discussion, let's just say it could process the text of any document I
throw at it.
For the sake of discussion, let's say I have five categories with many
document instances (training examples) from each of these five categories:
Doc Category #1 - streaming media protocols
Doc Category #2 - media format conversion tools
Doc Category #3 - DirectShow
Doc Category #4 - media content management systems
Doc Category #5 - none of the above
In my proposed system I drag a document instance into a watch folder, which
causes a text classifier to open it, analyze it and "tag" it somehow to
indicate to which of the five categories it belongs (say by moving it into
one of five directories).
Here are my five concerns.
*1. Embedding the SpamBase code into my app.
My first concern is whether or not the SpamBayes training and classifier
code is structured such that it can be embedded into this kind of tool. I'm
pretty comfortable with Python. But rewriting major pieces of SpamBayes to
do this app would not be fun, nor feasible.
*2. SpamBayes for Non-Email-Types of Classification.*
Does it even make sense to start with SpamBayes since my problem domain
doesn't have anything like email headers or the presence of an attachment,
etc. that SpamBayes probably uses in its core feature extraction?
*3. Discriminatory Training*
My next concern relates to the lack of discriminatory training between
categories. I think the way SpamBayes works is my training on a particular
class, say class 1, is building a model with which to make the
discrimination: Is this document instance a member of class 1 or not? When I
train the model for class 1 do I only include positive instances (the ham)
of Category 1? Or do I also include negative instances from the other
If the model for Category-1 is only trained on positive instance from that
category, then this trained model is independent of the trained models for
categories 2 through 5. And when it comes time to make a classification the
model that responds "loudest" is the one selected. But, and here's my
concern, there has never been a proabability model created that *discriminates
between* the categories. Does this make sense what I am describing? I guess
I'm thinking about Maximum Likelihood training of acoustic models in a
speech recognition system, which has this lack of discriminatory training
and I'm wondering if multi-class naive-Bayes classifiers have this same kind
*4. Adding a New Document Category.*
Let's say I have trained the models on my five classes (as described above)
and everything is working fine and I decide to add a new document category.
Do the first five models need to be trained from scratch (to include the
negative instances in this new sixth category)? Or can SpamBayes models be
"incrementally" trained by just training on these new class-6 negative
*5. Size of Training Sample*
My final concern relates to the number of training documents I would need.
I'm guessing that each of my documents, no matter how long or short, reduces
to a single feature vector for training and classification. Is this correct?
If so, it would seem that I would need *at least* hundreds of examples from
each category and probably thousands. Yes? No?
Thanks for any thoughts you might have on my concerns and questions.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the spambayes-dev