[Spambayes] python.org corpus updated

Tim Peters tim.one@comcast.net
Mon Oct 28 19:03:08 2002


[Jim Bublitz]
> Works great for me - I put all tagged/scrubbed or virgin virus msgs
> in my spam corpus from the start and haven't had a problem. I don't
> virus scan (Linux) but some of my ISPs do. The email module has some
> problems with them though, because some of the virus taggers mung
> the boundaries or attachments.
>
> Viruses looks like spam to me.

How do you tokenize?  We ignore MIME sections that aren't text/*, except for
generating metatokens from the MIME armor (content-type,
content-disposition, charset and filename parameter values).  There's
another option to suck up the first 5 decoded bytes of octet-stream
sections, but enabling that hasn't made any difference in my tests.

IOW, a typical virus generates a very small set of tokens, the way we
tokenize.  We're also missing src=cid: clues from iframe tags.