[spambayes-dev] Interesting unsure

T. Alexander Popiel popiel at wolfskeep.com
Wed Jun 25 15:07:06 EDT 2003


In message:  <16122.2081.275212.576909 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:

>I got an interesting spam just now.  [...]
>the subject had umlauts over many of the vowels:
>
>    ousp W=E4nt to mak=EB l=F6ve lik=EB a te=EBn?
>
>so of course, I got several tokens which the classifier ignored.

>
>    X-Spambayes-Debug: '*H*': 0.21; '*S*': 0.66; 'doorknob': 0.09;
>=09    'subject:?': 0.23; 'detect': 0.26; 'header:Message-ID:1': 0.37;
>=09    'header:Reply-To:1': 0.61; 'url:com': 0.61; 'url:www': 0.67;
>=09    'header:Received:2': 0.76; 'subject:\xf6': 0.84;
>=09    'content-type:text/html': 0.87; 'url:gif': 0.93
>    X-Spambayes-Classification: unsure; 0.73
>
>It's not clear much can be done, though it might be interesting to try
>an option to map Latin-1 accented characters to their unadorned ASCII
>counterparts, at least in subjects (strip_subject_accents?).

I suspect that would have serious detrimental effects for foreign
language users.

>The problem with trying such an experiment isn't that it might not be
>worthwhile, but that if it's a new spammer technique, there won't be
>many messages in our existing spam/ham databases which would exercise
>the technique.

I don't see this as any different from any of the other neologisms
that spammers come up with; if they persist in using such words
(and you're still training), then the odd words with accents will
quickly become strong spam indicators.  No need for us to do anything...
it's already going to be handled properly.

- Alex



More information about the spambayes-dev mailing list