[Spambayes] Watch out for this

Anthony Baxter anthony at interlink.com.au
Thu Sep 11 03:25:32 EDT 2003


>>> Skip Montanaro wrote
> Good suggestion.  I'm not sure if the tokenizer does this already, but a
> quick grep for '&#[0-9];' through my current training database (about 3
> million lines) suggests this is still fairly infrequently used.  I only
> found about 2100 lines (around 0.07%) of the lines contained a numeric
> entity.  If/when the spammers start using such techniques and they turn out
> to cause problems for the classifier, it should be fairly easy to extend the
> tokenizer to make the necessary substitutions.

Maybe we should have a file somewhere of "yet to be tested" tokeniser
ideas? And update it with a comment when we find what does or doesn't 
work? (Ref the discussion yesterday about tokenising tricks tried and
abandoned...)




More information about the Spambayes mailing list