[Spambayes] Ideas for an MSc project please...

Ryan Malayter rmalayter at bai.org
Wed Feb 4 17:28:48 EST 2004

[Skip Montanaro]
>>1) using Bayesian-like statistics to evaluate code 
>>   for virus-like behavior. 
> SpamBayes actually already does a pretty good job of this, 
> assuming viruses get that far within your email infrastructure.

Spambayes evaluates the e-mail messages that viruses send, not the viral
code itself. I was thinking about going further down, into the innards
of the actual binary (or script) code of the virus. Parsing would be a
challenge, of course, and probably language and platform dependent. But
most viral code has random IP address generators, SMTP engines, backdoor
programs, etc., so they must look fairly similar at some level. Even if
it's at the assembler level. Scripting languages would be much easier to
parse than assembler, of course. Bayesian analysis might help identify
such code.

Incidentally, we had about 2% of our users get tagged by MyDoom, even
though we block all executable attachments at the gateway. Apparently,
the rules in our AV software that apply to blocking file extensions do
not apply inside ZIP files, even though the product scans for viruses
inside ZIP files. We're started blocking all ZIPs as soon as we heard
about MyDoom, but a few people actually opened the ZIP and the EXE
inside despite all our efforts at education. We're now quarantining all
ZIPs and screening them by hand while we test an update from the AV
vendor. Argh.

Anyway, a system which could accurately evaluate the binary code for
characteristics might have caught MyDoom in this instance, seeing the
characteristics of an SMTP engine, registry edits to the RUN keys, etc.
The current "heuristic" scanners in most AV products suck, and never
detect anything that has no virus signature in my experience.

Parsing "polymorphic" and encrypted viruses might be more difficult, but
presumably there's a body of knowledge about this out there somewhere
(AV vendors do it already for cleaning).


