[Spambayes] Strip Subject of Non-alpha

Skip Montanaro skip at pobox.com
Mon Dec 8 14:56:52 EST 2003


    Dennis> I suggest that a filter be added which strips the subject line
    Dennis> of all non-alpha characters before scoring.  It can be scored on
    Dennis> the unstripped subject too, but on the stripped one too.  That
    Dennis> will detect messages where the spam words are broken up by dots,
    Dennis> periods, dashes, etc.

I have a local mod which adds an asciify_subject option to the tokenizer.
It uses a codec I wrote called 'latscii' which assumes the subject is
encoded as latin-1 (which seems to be the case for all the examples I've
seen) and then performs a mapping from accented to unaccented letters, and
maps symbols to ASCII characters somewhat arbitrarily (e.g., mapping the
registered trademark character to an 'R' and a British pound sign to '#').

I suppose I could check it in, though it's not clear that for the fairly
small number of these sort of messages I receive that it makes much
difference (though perhaps my code modification has a bug someone else could
spot).  I never got overwhelming encouragement for my ideas about how to add
experimental extensions to the CVS repository.

Skip



More information about the Spambayes mailing list