
Hi, I'm interested in contributing to GBayes .. I'm thinking of trying word stemming and adding other types of token indicators. How can I contribute? Btw, I have been saving up my spam for a year or so.. I have about 31,238 spam messages saved up now. These are categorized as spam based on my reading of the subject, or examining the body when in doubt. There are probably 10% dups in the corpus. Some of them have viruses, likely klez. I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it.. I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately. Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements

I'm interested in contributing to GBayes ..
I'm thinking of trying word stemming and adding other types of token indicators. How can I contribute?
Pretty soon, a SF propject will be created (Barry has already gotten the request in). We'll gladly add you to the list of developers.
Btw, I have been saving up my spam for a year or so.. I have about 31,238 spam messages saved up now. These are categorized as spam based on my reading of the subject, or examining the body when in doubt. There are probably 10% dups in the corpus. Some of them have viruses, likely klez.
Cool.
I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it..
If you can't wait for the SF project, you can find all the code in the Python CVS tree: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox...
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
You train it with an equal number of spam and non-spam ("ham") that you received. Just make sure the ham training messages contain enough representatives of the html-only mail. --Guido van Rossum (home page: http://www.python.org/~guido/)

I would like to be in on that project too please. David LeBlanc Seattle, WA USA
-----Original Message----- From: python-dev-admin@python.org [mailto:python-dev-admin@python.org]On Behalf Of Guido van Rossum Sent: Wednesday, September 04, 2002 17:24 To: bkc@murkworks.com Cc: python-dev@python.org Subject: Re: [Python-Dev] Getting started with GBayes testing
I'm interested in contributing to GBayes ..
I'm thinking of trying word stemming and adding other types of token indicators. How can I contribute?
Pretty soon, a SF propject will be created (Barry has already gotten the request in). We'll gladly add you to the list of developers.
Btw, I have been saving up my spam for a year or so.. I have about 31,238 spam messages saved up now. These are categorized as spam based on my reading of the subject, or examining the body when in doubt. There are probably 10% dups in the corpus. Some of them have viruses, likely klez.
Cool.
I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it..
If you can't wait for the SF project, you can find all the code in the Python CVS tree:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondi st/sandbox/spambayes/
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
You train it with an equal number of spam and non-spam ("ham") that you received. Just make sure the ham training messages contain enough representatives of the html-only mail.
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev

Guido addressed most points, so I'll just cover a few: [Brad Clements]
... I'd like to replicate Tim's test rig so I can compare my results with existing ones. My spam isn't in mbox format, but I can convert it.
Mine isn't either <wink>. Barry gave me mboxes, but the spam corpus I got off the web had one spam per file, and it only took two days of extreme pain to realize that one msg per file is enormously easier to work with when testing: you want to split these at random into random collections, you may need to replace some at random when testing reveals spam mistakenly called ham (and vice versa), etc -- even pasting examples into email is much easier when it's one msg per file (and the test driver makes it easy to print a msg's file path). My test driver and tokenizer are checked in (timtest.py), and also a little utility or two. The directory structure under my spambayes directory looks like so: Data/ Spam/ Set1/ (contains 2750 spam .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" Ham/ Set1/ (contains 4000 ham .txt files) Set2/ "" Set3/ "" Set4/ "" Set5/ "" reservoir/ (contains "backup ham") If you use the same names and structure, huge mounds of the tedious testing code will work as-is. The more Set directories the merrier, although you'll hit a point of diminishing returns if you exceed 10. The "reservoir" directory contains a few thousand other random hams. When a ham is found that's really spam, I delete it, and then the rebal.py utility moves in a message at random from the reservoir to replace it. If I had it to do over again, I think I'd move such spam into a Spam set (chosen at random), instead of deleting it.
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
It will learn about that -- not a problem. It's a problem in *my* tests because HTML mail is so strongly hated on tech lists, but newbies use it there anyway, and it would be horrid to block newbies just because they're normal people who enjoy creating visually attractive messages <0.9 wink>. Read the "What about HTML?" section in timtest.py. You may also with to remove the guard from if part.get_content_type() == "text/plain": text = html_re.sub(' ', text) in tokenize(). Once you have a good test setup, you can try it both ways, and the data will tell you which way works best for your normal mix. Details of runs both ways on my c.l.py corpora are given in the "What about HTML?" section mentioned before, and even there stripping HTML decorations out of HTML-only messages had an insignificant effect on the f-p rate. It increased the f-n rate, though, and precisely because HTML messages are so very rare on c.l.py that they're *almost* certainly spam.

On 4 Sep 2002 at 20:24, Guido van Rossum wrote:
Pretty soon, a SF propject will be created (Barry has already gotten the request in). We'll gladly add you to the list of developers.
I look forward to it.
I'm particularly intersted in how to allow html only messages (reduce false positives). I'm getting a lot of personal mail in that format, unfortunately.
You train it with an equal number of spam and non-spam ("ham") that you received. Just make sure the ham training messages contain enough representatives of the html-only mail.
This is one way to do it, but I was planning on experimenting with tokenizer methods that strip out HTML tags, leaving only the text. My feeling is that the presentation of "the message" is independent of the message itself, so if I get a message in Text, HTML, RTF only the actual content is important, not the markup method. Though I suppose using lots of red and large fonts might be an indicator of spam, the text of the message should still suffice. Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but he's not planning on doing that for use on technical lists. I would like to pursue general client-side filtering of spam, so I do need to contend with that. btw, Tim's comment:
# So if a message is multipart/alternative with both text/plain and text/html # branches, we ignore the latter, else newbies would never get a message # through. If a message is just HTML, it has virtually no chance of getting # through
Tells me (spammer hat on) that I can send message with a non-spammish text only part, and a spam html part since most "non-techie" email client users automatically display the html version when available, however Tim's implementation will ignore it. Most "average users" never even see the text-only part of multipart messages. In Tim's application, that's okay since he's going to use the text-only part anyway. But for my purposes, I need to consider both portions. So it's simpler for me to strip html and combine that text with the text-only part and then "test" the combined parts. Well these are just musings, I'll be looking for the SF project. -Brad Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com (315)268-9812 Fax AOL-IM: BKClements

"Brad Clements" wrote This is one way to do it, but I was planning on experimenting with tokenizer methods that strip out HTML tags, leaving only the text.
The set I'm working with, I found I needed to strip out everything but for src="" and href="" attributes of tags. Too much goodness in them for the system to get it's teeth into.
Tells me (spammer hat on) that I can send message with a non-spammish text only part, and a spam html part since most "non-techie" email client users automatically display the html version when available, however Tim's implementation will ignore it.
I've actually got a bunch of spam like that. The text/plain is something like **This is a HTML message** and nothing else. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

Brad> My feeling is that the presentation of "the message" is Brad> independent of the message itself, so if I get a message in Text, Brad> HTML, RTF only the actual content is important, not the markup Brad> method. Though I suppose using lots of red and large fonts might Brad> be an indicator of spam, the text of the message should still Brad> suffice. You might be surprised. In Paul Graham's "A New Plan for Spam" he writes: I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term. As Tim has pointed out several times, intuition and hunches about this stuff often turns out to be incorrect. Skip

[Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Anthony Baxter]
... I've actually got a bunch of spam like that. The text/plain is something like
**This is a HTML message**
and nothing else.
Are you sure that's in a text/plain MIME section? I've seen that many times myself, but it's always been in the prologue (*between* MIME sections -- so it's something a non-MIME aware reader will show you).

[Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ] [Brad Clements]
... My feeling is that the presentation of "the message" is independent of the message itself, so if I get a message in Text, HTML, RTF only the actual content is important, not the markup method.
Everything's A Clue. Everything that gets ignored partly blinds the classifier, so the question isn't whether there's a difference, it's how much of a difference it makes.
Though I suppose using lots of red and large fonts might be an indicator of spam, the text of the message should still suffice.
Indeed, Graham reported that the hex color code for bright red was one of the strongest spam indicators in his database.
Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but he's not planning on doing that for use on technical lists.
When HTML-only email is a 99.99% spam indicator on a tech list, it would be crazy to ignore that clue. But note that the comments *also* say I'd be delighted to remove HTML tags even there if some other way of slashing the f-n rate is proven to work (and most people who have tried it say that mining more header lines does do it -- but then I haven't seen anything from them about how they do when they ignore the header lines. I was happy to ignore header lines in order to get *some* kind of handle on how well could be done on "pure content", and turned out that works remarkably well).
# So if a message is multipart/alternative with both text/plain # and text/html branches, we ignore the latter, else newbies would never # get a message through. If a message is just HTML, it has virtually no # chance of getting through
Tells me (spammer hat on) that I can send message with a non-spammish text only part, and a spam html part since most "non-techie" email client users automatically display the html version when available, however Tim's implementation will ignore it.
Sure. It *certainly* isn't a problem on my test data (as witnessed by the measured error rates). If the nature of the world changes, the code has to adapt along with it. But 90% of the spam I receive (and I get a lot) is still trivial to recognize from a mere glance at the subject line, and I don't buy that spammers are a class of ubergeek with formidable skill. Response rates are a percentage game, and more so than anti-spammers I expect spammers are keen to go for high-percentage wins at the expense of esoterica.
Most "average users" never even see the text-only part of multipart messages. In Tim's application, that's okay since he's going to use the text-only part anyway. But for my purposes, I need to consider both portions. So it's simpler for me to strip html and combine that text with the text-only part and then "test" the combined parts.
Not unreasonable <wink>, but testing remains the only way to decide. It's rare you can out-think a fraction of a percent!
participants (6)
-
Anthony Baxter
-
Brad Clements
-
David LeBlanc
-
Guido van Rossum
-
Skip Montanaro
-
Tim Peters