Re: RE: [Python-Dev] The first trustworthy <wink> GBayes results
data:image/s3,"s3://crabby-images/ab5eb/ab5ebc109c801032638265021f1fb9a703ca74e0" alt=""
Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability. (Incidentally, why so high? In my db it's only 0.3930784.) --pg --Tim Peters wrote:
data:image/s3,"s3://crabby-images/e88a6/e88a6d57abf46790782357b4e08a5f8aa28e22e4" alt=""
[Paul Graham]
Yes, that makes sense, but I'm trained not to think <wink>. Experiment will decide it (although I *expect* it's a good change, and counting multiple occurrences was obviously a factor in several of the rare false positives). If spam really is different, it should be different in several distinct ways.
(Incidentally, why so high? In my db it's only 0.3930784.) --pg
I expect it's because this tokenizer *only* split on whitespace. Punctuation was left intact. So, e.g., on the Python discussion list stuff like The new approach blows it out of the water: and This is very deep water; and Then you'll take to Python like a duck takes to water! are counted as "water:" and "water;" and "water!", not as "water". The spam corpus is chock full o' "water", though: + Porn sites advertising water sports. + Assorted bottled water pitches. + Assorted "oxygenated water" pitches. + Claims of environmental friendliness explicated via stuff like "no harmful chlorine to pollute the water or air!". + Pitches for weight-loss gimmicks emphasizing that you'll really loss fat, not just reduce water retention. + Pitches for weight-loss gimmicks empphasizing that you'll reduce water retention as well as lose fat. + One repeated bizarre analogy for how a breast enlargement cream works in the way "a sponge absorbs water". + This revolutionary new flat garden hose will really cut your water bills. + Ditto this miracle new laundry tablet lets you use a fraction of the water needed by old-fashioned detergents. + Survivalist pitches often mention water in the same sentence as air and medical care. I got tired then <wink>.
data:image/s3,"s3://crabby-images/e88a6/e88a6d57abf46790782357b4e08a5f8aa28e22e4" alt=""
FYI, about counting multiple instances of a word multiple times, or only once, when scoring. Changing it to count words only once did fix the specific false positive examples I mentioned. However, across 20 test runs (training on one of five pairs of corpora, and then for each such training pair running predictions across the remaining four pairs), it was a mixed bag. On some runs it appeared to be a real improvement, on others a real regression. Overall, the results didn't support concluding it made a significant difference to the false positive rate, but weakly supported concluding that it increased the false negative rate. That's very tentative -- I didn't stare at the actual misclassifications, I just ran it while sleeping off a flu, then woke up and crunched the numbers. This ignorant-of-MIME tokenization scheme is ridiculously bad for the false negative rate anyway (an entire line of base64 or obfuscated quoted-printable looks like a ham-favoring single "unknown word" to it), so there are bigger fish to fry first.
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
For what it's worth, the attached (simple) script will 'de-spamassassin' an email message. I use it on my 'spam' folder to get test messages of various ugly MIME things that spam and viruses let through... It's not pretty, but it does the job (for me, anyway) -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/cbbce/cbbced8c47f7bfb197ed1a768a6942977c050e7c" alt=""
(trimming the cc list a bit, since this is drifting a bit away from strictly discussing the current algorithm.) Anthony> For what it's worth, the attached (simple) script will Anthony> 'de-spamassassin' an email message. I use it on my 'spam' Anthony> folder to get test messages of various ugly MIME things that Anthony> spam and viruses let through... Thanks, that helps me as well, as I need to delete the X-VM-* headers Emacs's VM mail package inserts. While spamassassin -d does what you are doing, it can be easily extended to elide other headers as well. One thing worth noting before everybody starts using it to massage their mailboxes is that the email package contains a bug which causes it to occasionally delete whitespace when reformatting headers. For example, in one example, the header went from Received: from rogers.com ([24.43.65.252]) by fep02-mail.bloor.is.net.cable.rogers.com (InterMail vM.5.01.05.06 201-253-122-126-106-20020509) with ESMTP id <20020820205424.DFHH4777.fep02-mail.bloor.is.net.cable.rogers.com@rogers.com>; Tue, 20 Aug 2002 16:54:24 -0400 to Received: from rogers.com ([24.43.65.252]) by fep02-mail.bloor.is.net.cable.rogers.com (InterMail vM.5.01.05.06 201-253-122-126-106-20020509) with ESMTPid <20020820205424.DFHH4777.fep02-mail.bloor.is.net.cable.rogers.com@rogers.com>; Tue, 20 Aug 2002 16:54:24 -0400 Note that in the second version there is no space between "ESMTP" and "id", which had previously been separated by a newline and several spaces. I filed a bug report about it a few days ago: http://python.org/sf/594893 Skip
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
There's one other known problem - seriously misformatted MIME (as seen in spam, and email from Microsoft Entourage) causes the email package to barf out. I plan, at some point, to try and make a "if it fails, just leave the body as one chunk of text" mode, but it's a long long way down my list of priorities. -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/50535/5053512c679a1bec3b1143c853c1feacdabaee83" alt=""
"AB" == Anthony Baxter <anthony@interlink.com.au> writes:
>> Skip Montanaro wrote >> One thing worth noting before everybody starts using it to >> massage their mailboxes is that the email package contains a >> bug which causes it to occasionally delete whitespace when >> reformatting headers. BTW, I fixed Greg's problem but not Skip's. I'm still looking at this one... AB> There's one other known problem - seriously misformatted MIME AB> (as seen in spam, and email from Microsoft Entourage) causes AB> the email package to barf out. I plan, at some point, to try AB> and make a "if it fails, just leave the body as one chunk of AB> text" mode, but it's a long long way down my list of AB> priorities. I just checked this into cvs. -Barry
data:image/s3,"s3://crabby-images/e88a6/e88a6d57abf46790782357b4e08a5f8aa28e22e4" alt=""
[Paul Graham]
Yes, that makes sense, but I'm trained not to think <wink>. Experiment will decide it (although I *expect* it's a good change, and counting multiple occurrences was obviously a factor in several of the rare false positives). If spam really is different, it should be different in several distinct ways.
(Incidentally, why so high? In my db it's only 0.3930784.) --pg
I expect it's because this tokenizer *only* split on whitespace. Punctuation was left intact. So, e.g., on the Python discussion list stuff like The new approach blows it out of the water: and This is very deep water; and Then you'll take to Python like a duck takes to water! are counted as "water:" and "water;" and "water!", not as "water". The spam corpus is chock full o' "water", though: + Porn sites advertising water sports. + Assorted bottled water pitches. + Assorted "oxygenated water" pitches. + Claims of environmental friendliness explicated via stuff like "no harmful chlorine to pollute the water or air!". + Pitches for weight-loss gimmicks emphasizing that you'll really loss fat, not just reduce water retention. + Pitches for weight-loss gimmicks empphasizing that you'll reduce water retention as well as lose fat. + One repeated bizarre analogy for how a breast enlargement cream works in the way "a sponge absorbs water". + This revolutionary new flat garden hose will really cut your water bills. + Ditto this miracle new laundry tablet lets you use a fraction of the water needed by old-fashioned detergents. + Survivalist pitches often mention water in the same sentence as air and medical care. I got tired then <wink>.
data:image/s3,"s3://crabby-images/e88a6/e88a6d57abf46790782357b4e08a5f8aa28e22e4" alt=""
FYI, about counting multiple instances of a word multiple times, or only once, when scoring. Changing it to count words only once did fix the specific false positive examples I mentioned. However, across 20 test runs (training on one of five pairs of corpora, and then for each such training pair running predictions across the remaining four pairs), it was a mixed bag. On some runs it appeared to be a real improvement, on others a real regression. Overall, the results didn't support concluding it made a significant difference to the false positive rate, but weakly supported concluding that it increased the false negative rate. That's very tentative -- I didn't stare at the actual misclassifications, I just ran it while sleeping off a flu, then woke up and crunched the numbers. This ignorant-of-MIME tokenization scheme is ridiculously bad for the false negative rate anyway (an entire line of base64 or obfuscated quoted-printable looks like a ham-favoring single "unknown word" to it), so there are bigger fish to fry first.
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
For what it's worth, the attached (simple) script will 'de-spamassassin' an email message. I use it on my 'spam' folder to get test messages of various ugly MIME things that spam and viruses let through... It's not pretty, but it does the job (for me, anyway) -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/cbbce/cbbced8c47f7bfb197ed1a768a6942977c050e7c" alt=""
(trimming the cc list a bit, since this is drifting a bit away from strictly discussing the current algorithm.) Anthony> For what it's worth, the attached (simple) script will Anthony> 'de-spamassassin' an email message. I use it on my 'spam' Anthony> folder to get test messages of various ugly MIME things that Anthony> spam and viruses let through... Thanks, that helps me as well, as I need to delete the X-VM-* headers Emacs's VM mail package inserts. While spamassassin -d does what you are doing, it can be easily extended to elide other headers as well. One thing worth noting before everybody starts using it to massage their mailboxes is that the email package contains a bug which causes it to occasionally delete whitespace when reformatting headers. For example, in one example, the header went from Received: from rogers.com ([24.43.65.252]) by fep02-mail.bloor.is.net.cable.rogers.com (InterMail vM.5.01.05.06 201-253-122-126-106-20020509) with ESMTP id <20020820205424.DFHH4777.fep02-mail.bloor.is.net.cable.rogers.com@rogers.com>; Tue, 20 Aug 2002 16:54:24 -0400 to Received: from rogers.com ([24.43.65.252]) by fep02-mail.bloor.is.net.cable.rogers.com (InterMail vM.5.01.05.06 201-253-122-126-106-20020509) with ESMTPid <20020820205424.DFHH4777.fep02-mail.bloor.is.net.cable.rogers.com@rogers.com>; Tue, 20 Aug 2002 16:54:24 -0400 Note that in the second version there is no space between "ESMTP" and "id", which had previously been separated by a newline and several spaces. I filed a bug report about it a few days ago: http://python.org/sf/594893 Skip
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
There's one other known problem - seriously misformatted MIME (as seen in spam, and email from Microsoft Entourage) causes the email package to barf out. I plan, at some point, to try and make a "if it fails, just leave the body as one chunk of text" mode, but it's a long long way down my list of priorities. -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/50535/5053512c679a1bec3b1143c853c1feacdabaee83" alt=""
"AB" == Anthony Baxter <anthony@interlink.com.au> writes:
>> Skip Montanaro wrote >> One thing worth noting before everybody starts using it to >> massage their mailboxes is that the email package contains a >> bug which causes it to occasionally delete whitespace when >> reformatting headers. BTW, I fixed Greg's problem but not Skip's. I'm still looking at this one... AB> There's one other known problem - seriously misformatted MIME AB> (as seen in spam, and email from Microsoft Entourage) causes AB> the email package to barf out. I plan, at some point, to try AB> and make a "if it fails, just leave the body as one chunk of AB> text" mode, but it's a long long way down my list of AB> priorities. I just checked this into cvs. -Barry
participants (5)
-
Anthony Baxter
-
barry@python.org
-
Paul Graham
-
Skip Montanaro
-
Tim Peters