CRM114 in November breaks 99.9%. :-)
Final test statistics for CRM114 for November are in: Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5) For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1) Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1993 3914 4 0 5911 99.915 2 Spam features in hash tables: 398K Nonspam features in hash tables: 299K There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such. This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy). So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.) --- On The Other Hand (the bad news)--- December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style. I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters. -Bill Yerazunis
Bill Yerazunis said the following on 02/12/02 14:44:
Final test statistics for CRM114 for November are in:
Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5)
For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1)
Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1993 3914 4 0 5911 99.915 2
Spam features in hash tables: 398K Nonspam features in hash tables: 299K
CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-)
From: Matt Sergeant <msergeant@startechgroup.co.uk> CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc. It _is_ procedural, it's just extremely high level. Perhaps higher-level than APL if you count statements rather than operators. And sorry about the syntax. I was being playful, and reading a book on Latin at the time, which is why it uses symmetric declensional parsing rather than something more sane, like recursive descent. (*) Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat. Yes, it's not difficult to get at the code. Pop the .gz open, emacs the file crm114.c, and look for the case headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there is _not_ generated, but executed in-line, so cut and paste will work. The current code requires a null-terminated string as input, but that's because of the GNU regex library limits (when TRE gives me a new library, that requirement will go away). You _will_ need to link it against a regex library (of your choice, CRM114 uses the standard ANSI regcomp/regexec calling sequence), and the OS itself needs to support stat() [for file existence/length] and mmap() [to map a file into virtual memory without actually reading it in a byte at a time- this is just for efficiency and can be worked around]. How bad do you want it? :-) If not, we'll just have to try and copy the sparse binary polynomial hash idea ;-) Always legitimate. It's GPLware, no problemo. -Bill Yerazunis (*) all in all, I like the way it ended up; one can just type programs on the command line and they do useful things. But hindsight is always 20/20, and "less wierdass" might be better in the long run.
Bill Yerazunis said the following on 02/12/02 15:57:
From: Matt Sergeant <msergeant@startechgroup.co.uk>
CRM114's learn and classify stuff looks really interesting, but it has a really freaky syntax to someone who is used to regular procedural or OO languages like Perl, Python, C, etc.
It _is_ procedural, it's just extremely high level. Perhaps higher-level than APL if you count statements rather than operators.
Sorry, I meant "prodedural like Perl/Python/C" not "procedural, like Perl/Python/C". Actually maybe python shouldn't be in that list since it has a weirdass syntax too :-)
Is there *any* chance the library in crm114 for learning and classifying can be extracted into a plain .so? That would be tremendous, and I'd be willing to build a perl XS library for it in a heartbeat.
Yes, it's not difficult to get at the code.
Pop the .gz open, emacs the file crm114.c, and look for the case headers "CRM_LEARN" and "CRM_CLASSIFY" respectively. The code there is _not_ generated, but executed in-line, so cut and paste will work.
The current code requires a null-terminated string as input, but that's because of the GNU regex library limits (when TRE gives me a new library, that requirement will go away). You _will_ need to link it against a regex library (of your choice, CRM114 uses the standard ANSI regcomp/regexec calling sequence), and the OS itself needs to support stat() [for file existence/length] and mmap() [to map a file into virtual memory without actually reading it in a byte at a time- this is just for efficiency and can be worked around].
I was thinking of punting on splitting the email to tokens back to the host language. Since perl and python both support POSIX regexps (and thus [[:graph:]]) its probably easier that way. Unless there's an inherent reason it has to be embedded in the library.
How bad do you want it? :-)
What interests me is the hashing technique. It should be reasonably easy to extract that, but for me it's just a lack of tuits - it's hard enough keeping up with my regular day to day activities, and my todo list never gets shorter.
(*) all in all, I like the way it ended up; one can just type programs on the command line and they do useful things. But hindsight is always 20/20, and "less wierdass" might be better in the long run.
I imagine you'd get a few more users with a regular syntax ;-) Matt.
The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set? At 09:44 AM 12/2/2002, Bill Yerazunis wrote:
Final test statistics for CRM114 for November are in:
Standard rules apply (no whitelists, no blacklists, realtime email stream only (no "canned spam"), train only on errors, polynomial length 5)
For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1)
Spams Nonspams False False Total N+1 Accuracy NHC's Accepts Rejects Emails 1993 3914 4 0 5911 99.915 2
Spam features in hash tables: 398K Nonspam features in hash tables: 299K
There was just 1 spam that got through in the last week of November- a very strange spam written in mixed English and Czech trying to sell me diesel engine parts. It came through on a moto-head email list, which I suppose might be slightly topical, and it certainly was amusing, rather reminiscent of the Monty Python "camshaft smuggling" skit, but it's still spam and counts as such.
This gives an N+1 accuracy of > 99.9% for the entire month of November. (99.932% for N-accuracy).
So, CRM114 barely squeaked through the month at >99.9%. Barely. There's clearly still work to be done (the spambayes mailing list is kicking around the proper way to evaluate probabilities; I'm looking into some of their ideas as well.)
--- On The Other Hand (the bad news)---
December is looking much worse - TWO have gotten through already over the weekend (one "barnyard teen" pornspam- it hasn't seen that before) and one very short mortgage solicitation, written folksy-style.
I'm also getting mailer errors now out of Sendmail whenever I do a "learn"; I'm starting to think that our systems people have upgraded something and broken something else in the process. This throws some question onto whether the CRM114 training code is actually getting run at all, or whether the increasing spam rate is symptomatic of the evolution of spam against static filters.
-Bill Yerazunis
At 11:04 AM -0500 12/2/02, Ken Anderson wrote:
The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set?
Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters.
Yes, this is my concern. I think the approach Robert describes is perfectly find for adaptively learning how to filter email, though there should probably be some for of forgetting, though the system will eventually forget on its own as words occur less often. However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing. This gives a lower performance score that is likely to be more accurate on real data. If your getting 3 9's be sure you're getting them the hard way. k At 05:35 PM 12/2/2002, Robert Woodhead wrote:
At 11:04 AM -0500 12/2/02, Ken Anderson wrote:
The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set?
Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only.
R
-- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters.
--On Monday, December 02, 2002 9:00 PM -0500 Ken Anderson <kanderson@bbn.com> wrote:
However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing.
This gives a lower performance score that is likely to be more accurate on real data.
Absolutely. That's the way I evaluate algorithms in SpamProbe as well. I use 10 different random partitionings of my good and bad spams into training and test subsets. Some tests yield excellent results. Others yield bad results. The average is always somewhere in the middle. Taking only a single partitioning isn't a very good way to evaluate the accuracy of an algorithm. All the best, ++Brian
However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing.
ah, but the point is, since each individual user will have his own email stream to train on, all you care about is how accurate the system is when it looks at the very next email that comes in. Thus, a system that gets very good after a few weeks of training on all the incoming mail, AND STAYS THAT WAY, is what you want in the real world. Dividing up training sets can be good for analysing the statistical properties of particular algorithm choices, but what counts (in a production environment) is real world performance, and real world filters have to adapt as the spam (and ham) changes over time. Tests like "pick a random sample, train on it, and then pick another sample (nonintersecting) from the same corpus, and test" don't properly reflect the real world environment. Spams are ordered by time! Thus, my philosophical position is that a real world app has to train on every incoming email (and be corrected by the user when it goofs). At 9:30 PM -0500 12/2/02, Bill Yerazunis wrote:
The reason I haven't auto-trained is due to my lack of understanding on what the limiting amount of self-teaching one can allow that doesn't go off into belly gaze.
This cannot happen unless the user is derelict in not correcting the output. If he is, then the input to the training system is 100% correct. And if the training system has an aging system, correction mistakes will eventually decay (and, if they cause misclassifications, the user will notice and correct the filter). Keep in mind there is always a new stream of incoming spam and ham to work with. R -- =========================================================== Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/ =========================================================== http://selfpromotion.com/ The Net's only URL registration SHARESERVICE. A power tool for power webmasters.
X-Sender: trebor@mail.animeigo.com Date: Mon, 2 Dec 2002 17:35:36 -0500 From: Robert Woodhead <trebor@animeigo.com> Cc: spamfilt@archub.org, spambayes@python.org X-Spam-Status: No, hits=-14.9 required=7.0 tests=IN_REP_TO,REFERENCES,SIGNATURE_SHORT_DENSE, SPAM_PHRASE_01_02,SUBJECT_MONTH,SUBJECT_MONTH_2 version=2.41 X-Spam-Level: At 11:04 AM -0500 12/2/02, Ken Anderson wrote:
The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set?
Training a particular incarnation of CRM114 usually takes a week or two; I read my mail (both categories) and when I find a piece of mail misclassified, I train that one piece into the filter. After a couple of days the errors get very sparse; after two or three weeks, I "go for data" and that's what gets reported in the monthlies. The current spam.css files are pretty much based on the live spam errors in the first week of October; since only four spam came through in all of November and only two were worth training on (the Czech Diesel Parts spam was just too funny to train out), the .css files are pretty much unchanged. Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only. I did put in that capability as a flag called "refute". You can say learn < refute > ( spamfile.css ) /[[:graph:]]/ to unlearn something as nonspam, and then you can relearn it in the proper category, but except for testing code paths, I've never actually used it. On the other hand, there's an old difficulty in AI that one of my teachers called "the Kalman Belly Gaze". If you let a filter (of any type, he was teaching Kalman filters at the time but it applies to any trained filter) learn on it's own output stream, it quickly reinforces it's own behavior to the exclusion of all else (i.e. it goes off and gazes at it's own navel, simply ignoring the reality of the world around it). The reason I haven't auto-trained is due to my lack of understanding on what the limiting amount of self-teaching one can allow that doesn't go off into belly gaze. -Bill Yerazunis
--On Monday, December 02, 2002 9:30 PM -0500 Bill Yerazunis <wsy@merl.com> wrote:
Training a particular incarnation of CRM114 usually takes a week or two; I read my mail (both categories) and when I find a piece of mail misclassified, I train that one piece into the filter.
Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small?
After a couple of days the errors get very sparse; after two or three weeks, I "go for data" and that's what gets reported in the monthlies.
Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? All the best, ++Brian
From: Brian Burton <brian@burton-computer.com>
Training a particular incarnation of CRM114 usually takes a week or two; I read my mail (both categories) and when I find a piece of mail misclassified, I train that one piece into the filter.
Training only on errors after a cut-off point is interesting. Why do you do this? Is there a reason not to increment the good/spam counts for terms in every email? Is it to avoid overflowing the counts in your hash table or is this likely to be more accurate since it keeps the message counts small? The reason I started doing it is that I used "unsigned char" as the counters in the big hash tables, to keep them as small as reasonable (remember, we're doing really _random_ accesses of these files and we thrash virtual memory and cache like crazy). The bin incrementer is "smart" in that it won't wrap past 255, but it is losing data at that point, and losing it on the _most_ significant features. I did consider "uncorking" the values up to unsigned int16, but I haven't had a good justification to do that yet. It's a simple change and if there's a need, it'll happen.
After a couple of days the errors get very sparse; after two or three weeks, I "go for data" and that's what gets reported in the monthlies.
Perhaps I misunderstand, but doesn't that mean that you are training up to a desirable accuracy before beginning to measure your accuracy? Is the transition from training to performance measurement based on a predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in corpus, or 14 calendar days of training) or based on the accuracy rising to a certain level? It's measured intuitively, by when I find I'm just not getting enough errors to keep my attention in training. This _is_ human-guided training, mind you. Other influences on when to start are "it's the start of November, start getting data". and "now that the BCR has that nasty underflow problem fixed and the data has settled down, let's get numbers". The other issue that can't be dodged is that spam is not ergodic; spam evolves in fits and starts; my spam of 1996 is very different than my spam of 2002. Any filter that is trained and tested against data statically is operating "in vitro"- a necessary and useful scientific measure but it misses the point of how well a spam filter can retrain on the fly against evolution in action. The training period coincidentally works out to be about 2+ weeks of training, and co-coincidentally I usually have just a few bins in the hash table maxing out about then. (right now I've got 7 bins out of a million maxed out in the spam hashtable, and 5 bins out of a million maxed out in the nonspam hashtable.) If I were to find that I was maxing out a significant number of bins (say, hundreds) I'd rebuild with unsigned int16 bins and accept the performance hit. (yes, this is a very "engineering" style approach; I'm not a good mathematician, so I just do experiments and report on what comes back.) For those of you with exceptionally high boredom thresholds, the current under-test spectra histograms follow. It does exhibit a comforting long distribution tail. -Bill Y. Sparse spectra file spam.css has 1048577 bins total total number of hash datums in this file is 398830 now scanning bins- please be patient... bin value 0 found 786135 times bin value 1 found 188350 times bin value 2 found 48948 times bin value 3 found 11125 times bin value 4 found 8550 times bin value 5 found 2511 times bin value 6 found 992 times bin value 7 found 464 times bin value 8 found 470 times bin value 9 found 240 times bin value 10 found 140 times bin value 11 found 104 times bin value 12 found 77 times bin value 13 found 65 times bin value 14 found 46 times bin value 15 found 47 times bin value 16 found 32 times bin value 17 found 36 times bin value 18 found 19 times bin value 19 found 17 times bin value 20 found 30 times bin value 21 found 11 times bin value 22 found 14 times bin value 23 found 8 times bin value 24 found 7 times bin value 25 found 7 times bin value 26 found 6 times bin value 27 found 10 times bin value 28 found 9 times bin value 29 found 7 times bin value 30 found 6 times bin value 31 found 6 times bin value 32 found 5 times bin value 33 found 2 times bin value 34 found 5 times bin value 35 found 2 times bin value 36 found 6 times bin value 37 found 5 times bin value 38 found 2 times bin value 39 found 2 times bin value 40 found 4 times bin value 41 found 2 times bin value 43 found 3 times bin value 44 found 1 times bin value 46 found 3 times bin value 47 found 1 times bin value 50 found 2 times bin value 52 found 3 times bin value 53 found 3 times bin value 55 found 1 times bin value 56 found 3 times bin value 58 found 1 times bin value 60 found 1 times bin value 62 found 1 times bin value 64 found 1 times bin value 69 found 1 times bin value 73 found 1 times bin value 74 found 1 times bin value 76 found 1 times bin value 77 found 1 times bin value 89 found 1 times bin value 90 found 2 times bin value 103 found 1 times bin value 105 found 2 times bin value 116 found 1 times bin value 121 found 1 times bin value 130 found 1 times bin value 143 found 1 times bin value 146 found 1 times bin value 157 found 1 times bin value 171 found 1 times bin value 175 found 2 times bin value 189 found 1 times bin value 208 found 1 times bin value 255 found 7 times Sparse spectra file nonspam.css has 1048577 bins total total number of hash datums in this file is 299527 now scanning bins- please be patient... bin value 0 found 819494 times bin value 1 found 187269 times bin value 2 found 31009 times bin value 3 found 7158 times bin value 4 found 1776 times bin value 5 found 614 times bin value 6 found 371 times bin value 7 found 165 times bin value 8 found 100 times bin value 9 found 76 times bin value 10 found 74 times bin value 11 found 46 times bin value 12 found 46 times bin value 13 found 29 times bin value 14 found 46 times bin value 15 found 53 times bin value 16 found 38 times bin value 17 found 16 times bin value 18 found 24 times bin value 19 found 9 times bin value 20 found 5 times bin value 21 found 11 times bin value 22 found 7 times bin value 23 found 13 times bin value 24 found 5 times bin value 25 found 6 times bin value 26 found 6 times bin value 27 found 5 times bin value 28 found 3 times bin value 29 found 3 times bin value 30 found 10 times bin value 31 found 5 times bin value 32 found 4 times bin value 33 found 4 times bin value 34 found 3 times bin value 35 found 3 times bin value 36 found 5 times bin value 37 found 2 times bin value 38 found 3 times bin value 39 found 3 times bin value 40 found 2 times bin value 41 found 2 times bin value 45 found 1 times bin value 46 found 2 times bin value 48 found 3 times bin value 49 found 3 times bin value 50 found 1 times bin value 51 found 1 times bin value 52 found 2 times bin value 54 found 1 times bin value 55 found 1 times bin value 56 found 1 times bin value 57 found 1 times bin value 58 found 1 times bin value 59 found 1 times bin value 60 found 1 times bin value 64 found 1 times bin value 66 found 1 times bin value 67 found 1 times bin value 71 found 2 times bin value 72 found 1 times bin value 74 found 1 times bin value 75 found 1 times bin value 78 found 1 times bin value 79 found 1 times bin value 80 found 2 times bin value 82 found 2 times bin value 83 found 1 times bin value 86 found 1 times bin value 95 found 1 times bin value 102 found 1 times bin value 104 found 1 times bin value 113 found 1 times bin value 122 found 1 times bin value 138 found 1 times bin value 164 found 1 times bin value 169 found 1 times bin value 173 found 1 times bin value 183 found 1 times bin value 189 found 1 times bin value 222 found 1 times bin value 254 found 1 times bin value 255 found 5 times Enter bin value to zeroize, or 0 to exit:
participants (5)
-
Bill Yerazunis -
Brian Burton -
Ken Anderson -
Matt Sergeant -
Robert Woodhead