[Spambayes] python.org corpus updated
Tim Peters
tim.one@comcast.net
Mon Oct 28 19:03:08 2002
[Jim Bublitz]
> Works great for me - I put all tagged/scrubbed or virgin virus msgs
> in my spam corpus from the start and haven't had a problem. I don't
> virus scan (Linux) but some of my ISPs do. The email module has some
> problems with them though, because some of the virus taggers mung
> the boundaries or attachments.
>
> Viruses looks like spam to me.
How do you tokenize? We ignore MIME sections that aren't text/*, except for
generating metatokens from the MIME armor (content-type,
content-disposition, charset and filename parameter values). There's
another option to suck up the first 5 decoded bytes of octet-stream
sections, but enabling that hasn't made any difference in my tests.
IOW, a typical virus generates a very small set of tokens, the way we
tokenize. We're also missing src=cid: clues from iframe tags.