[Spambayes] How do you classify text?
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Wed Apr 23 11:15:37 EDT 2003
4/23/2003 4:37:16 AM, Miguel Sevillano <msevilla at gts.tsc.uvigo.es> wrote:
> I'm working in a project that must classify a paragraph as one among
>N subjects. I would like to know exactly how you take a paragraph and
>classify it; how do you train the filter?.
> I would like to apply bayesian rules to distinguish among N
>differents subjects which a paragraph is talking about.
Spambayes will classify into three buckets at most: positive classification,
negative classification, and unsure. To apply this to n subjects, you'd need
to apply the filter n-1 times. For classifications c(1)...c(n), you would
first apply the filter for c(1), removing all positive c(1) classifications
from your input set. Then filter for c(2), removing all positives, etc... to
c(n). You may indeed end up with negative and unsure classifications after
the final c(n) filtering... Each of these filters would require a bayesian
classification database (PersistentClassifier in spambayes), and would have to
be trained separately, by feeding known positives to each via the learn()
method. Filtering is initiated by using the spamprob method on a particular
classifier, sending it the text that has been tokenized by our tokenizer. You
can see a clear example of this training and filtering activity in the
If you don't currently know python, you might want to get yourself a python
primer and read it, as there is a bit of advanced python stuff in this code.
By and large, the code is quite readable, though, so check it out and have a
peek. Again, start at the imapfilter, and don't get hung up on the imap-
c'est moi - TimS
There are 10 kinds of people in the world:
those who understand binary,
and those who don't.
More information about the Spambayes