[Spambayes] idea for tokenizer.crack_filename change
Greg Ward
gward@python.net
Thu, 19 Sep 2002 09:04:24 -0400
On 18 September 2002, Neale Pickett said:
> In going over some of my spam, I was surprised to see that the following
> wasn't penalized:
>
> ------=_NextPart_000_0039_0173A692.99A692D0
> Content-Type: application/octet-stream; name="Video.pif"
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment; filename="Video.pif"
That's almost certainly a virus. Look at the base64-decoded body --
I betcha it starts with "MZ", which is a DOS/Windows executable.
Spam detectors should not be distracted by trying to detect viruses.
It's OK if spam detector happens to catch some viruses, but -- apart
from some similarities in the headers and the MIME strucuture -- spam
and viruses tend to be very different beasts. My gut instinct says
viruses should not be included in the "spam" corpus for training a spam
detector -- too many paths to follow, too ambiguous, or something like
that.
However, detecting viruses is pretty damn easy. See
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/elspy/elspy/lib/execontent_simple.py?rev=1.6&content-type=text/vnd.viewcvs-markup
for the code used on mail.python.org and starship.python.net. In a
nutshell, this code scans the message body for
/^content-(type|disposition):.*(file)?name=.*\.(\w+)/
and, if the extension that matches that final \w+ is one of
("exe", "com", "vbs", "pif", ...), then you reject the message.
Shelling out big bucks for a virus scanner is kinda silly if
you ask me.
Back to our regularly scheduled programming...
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
Rules for Urban Cycling, #1:
Green means go; yellow means go like hell; red means proceed with caution.