[Spambayes] Strip Subject of Non-alpha
skip at pobox.com
Mon Dec 8 14:56:52 EST 2003
Dennis> I suggest that a filter be added which strips the subject line
Dennis> of all non-alpha characters before scoring. It can be scored on
Dennis> the unstripped subject too, but on the stripped one too. That
Dennis> will detect messages where the spam words are broken up by dots,
Dennis> periods, dashes, etc.
I have a local mod which adds an asciify_subject option to the tokenizer.
It uses a codec I wrote called 'latscii' which assumes the subject is
encoded as latin-1 (which seems to be the case for all the examples I've
seen) and then performs a mapping from accented to unaccented letters, and
maps symbols to ASCII characters somewhat arbitrarily (e.g., mapping the
registered trademark character to an 'R' and a British pound sign to '#').
I suppose I could check it in, though it's not clear that for the fairly
small number of these sort of messages I receive that it makes much
difference (though perhaps my code modification has a bug someone else could
spot). I never got overwhelming encouragement for my ideas about how to add
experimental extensions to the CVS repository.
More information about the Spambayes