[Spambayes] Watch out for this
Anthony Baxter
anthony at interlink.com.au
Thu Sep 11 03:25:32 EDT 2003
>>> Skip Montanaro wrote
> Good suggestion. I'm not sure if the tokenizer does this already, but a
> quick grep for '&#[0-9];' through my current training database (about 3
> million lines) suggests this is still fairly infrequently used. I only
> found about 2100 lines (around 0.07%) of the lines contained a numeric
> entity. If/when the spammers start using such techniques and they turn out
> to cause problems for the classifier, it should be fairly easy to extend the
> tokenizer to make the necessary substitutions.
Maybe we should have a file somewhere of "yet to be tested" tokeniser
ideas? And update it with a comment when we find what does or doesn't
work? (Ref the discussion yesterday about tokenising tricks tried and
abandoned...)
More information about the Spambayes
mailing list