[spambayes-dev] saving attachments
tim.one at comcast.net
Mon Mar 8 15:10:37 EST 2004
> I have been accumulating a message corpus for testing that is now
> becoming alarmingly large. My cup doth runneth over. AFAIK,
> SpamBayes does nothing with attachments. Neither the existence of
> one nor its name, size or contents are considered.
That's unique to the Outlook addin, and is due to that Outlook destroys the
original MIME structure. In other ways of using spambayes, all and only
attachments of MIME type text/* are tokenized, and tokens are synthesized
for all MIME sections, recording (from a comment in tokenizer.py):
# Generate tokens for:
# and its type= param
# and its filename= param
# all the charsets
# This has huge benefit for the f-n rate, and virtually no effect on
# the f-p rate, although it does reduce the variance of the f-p rate
# across different training sets (really marginal msgs, like a brief
# HTML msg saying just "unsubscribe me", are almost always tagged as
# spam now; before they were right on the edge, and now the
# multipart/alternative pushes them over it more consistently).
> While most of the spam in my corpus is attachment-free, the ham has
> lots of them and many are quite large (engineering drawing packages
> for review).
They wouldn't have MIME type text/*, so only the synthesized tokens above
would be generated for them.
> It would reduce the size of the corpus .pst file considerably if I
> could delete all attachments. I have an inexpensive commercial tool
> that can do this, however, I don't want to if anyone is considering
> using attachments in future versions.
> FWIW, I don't see attachments as having much potential for spam
Tests before said that their MIME types, file names, and charsets did help.
> The number of tokens could easily dwarf the original message and need
> not be related to it in any way. The last thing we want to do is to
> encourage spammers to tack on huge attachments,
They won't -- bandwidth is a primary cost for bulk emailers, and big
messages limit the rate at which they can send spam out.
> though word salad attacks have been totally ineffective on my machine
> and most others who mentioned it on this list. However, including
> the full text of actual natural language works might have better
> luck, and I wouldn't want to be responsible for encouraging that
> practice, i.e. really bad Karma, hate mail and death threats, so I
> would think that continuing to ignore attachments is a good strategy.
The Outlook addin ignores them only because nobody has endured the pain
necessary to try to guess what the original MIME structure might have been.
> So, have there been any rumblings about possibly using attachment
> information? Am I reasonably safe in deleting all the attachments in
> my message corpus for the foreseeable future?
No way of using spambayes makes any use of the *content* of non-text/*
attachments, so you're certainly safe deleting those. Big attachments very
rarely have a text type, so that would save the bulk of the space.
More information about the spambayes-dev