[spambayes-dev] Non email classification

Tue Jul 15 17:03:53 EDT 2003

[Tony Meyere]
> If one of the maths experts could answer this, that would be great :)
>
> I understand that the tokenizer is designed with the ham/spam
> classification in mind.  Is the classifier likewise designed, or
> should it be good at any binary classification?
>
> In particular, should the current chi2 classifier be a better binary
> classifier than the older spambayes classifiers (gary et al) whether
> classifying ham/spam, or any other pair?

Probably.  If you're using a dict-based classifier, it doesn't even care
whether "the tokens" are strings, it just needs them to be hashable and to
support equality comparison.  (For an example relevant to another current
thread, in one experiment the classifier was fed integer hash codes as
tokens, instead of strings.)

It's *almost* a domain-neutral algorithm for deciding whether a clump of
probabilities is uniformly distributed.  The "almost" is due to that the
classifier weeds duplicates out of the tokens, reducing the clump to a set
of unique tokens.  Testing showed that duplicate removal was a het win for
the ham-vs-spam task.  It's possible that leaving duplicates in would be
better for some other discrimination task.