[Spambayes] idea for tokenizer.crack_filename change

Greg Ward gward@python.net
Thu, 19 Sep 2002 09:04:24 -0400


On 18 September 2002, Neale Pickett said:
> In going over some of my spam, I was surprised to see that the following
> wasn't penalized:
> 
>   ------=_NextPart_000_0039_0173A692.99A692D0
>   Content-Type: application/octet-stream; name="Video.pif"
>   Content-Transfer-Encoding: base64
>   Content-Disposition: attachment; filename="Video.pif"

That's almost certainly a virus.  Look at the base64-decoded body --
I betcha it starts with "MZ", which is a DOS/Windows executable.

Spam detectors should not be distracted by trying to detect viruses.
It's OK if spam detector happens to catch some viruses, but -- apart
from some similarities in the headers and the MIME strucuture -- spam
and viruses tend to be very different beasts.  My gut instinct says
viruses should not be included in the "spam" corpus for training a spam
detector -- too many paths to follow, too ambiguous, or something like
that.

However, detecting viruses is pretty damn easy.  See

  http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/elspy/elspy/lib/execontent_simple.py?rev=1.6&content-type=text/vnd.viewcvs-markup

for the code used on mail.python.org and starship.python.net.  In a
nutshell, this code scans the message body for

  /^content-(type|disposition):.*(file)?name=.*\.(\w+)/

and, if the extension that matches that final \w+ is one of
("exe", "com", "vbs", "pif", ...), then you reject the message.
Shelling out big bucks for a virus scanner is kinda silly if
you ask me.

Back to our regularly scheduled programming...

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Rules for Urban Cycling, #1:
Green means go; yellow means go like hell; red means proceed with caution.