Hi, I don't know whether this idea is new, and I do not need it myself. But my thought is that bayesian filtering may be used for other means than only spam-ham-checking. As I Understand It, spambayes has a database with spam and one with ham, compares each message new with the messages in each database and calculates spam/ham probabilities this way. Would it be possible to extend this to 3, 4, or 10 different message databases? Suppose someone gets a lot of e-mail (for example, because he is a famous politician or something like that). He may want to filter it into several mailboxes, for example, based on what the e-mail is about (foreign policy, education, taxes, etc). This is too complex for simple regexp-rule-filtering. Could bayesian filtering be a solution? If he would show 5 mails about each subject, could Bayesian filtering do a good job in filtering all following emails according to those databases? It's just a thought... yours, Gerrit. -- 217. If he be the slave of some one, his owner shall give the physician two shekels. -- 1780 BC, Hammurabi, Code of Law -- Asperger Syndroom - een persoonlijke benadering: http://people.nl.linux.org/~gerrit/ Kom in verzet tegen dit kabinet: http://www.sp.nl/
From: Gerrit Holl <gerrit@nl.linux.org> I don't know whether this idea is new, and I do not need it myself. But my thought is that bayesian filtering may be used for other means than only spam-ham-checking. As I Understand It, spambayes has a database with spam and one with ham, compares each message new with the messages in each database and calculates spam/ham probabilities this way. Would it be possible to extend this to 3, 4, or 10 different message databases? Suppose someone gets a lot of e-mail (for example, because he is a famous politician or something like that). He may want to filter it into several mailboxes, for example, based on what the e-mail is about (foreign policy, education, taxes, etc). This is too complex for simple regexp-rule-filtering. Could bayesian filtering be a solution? If he would show 5 mails about each subject, could Bayesian filtering do a good job in filtering all following emails according to those databases? It's just a thought... Done-and-tested in CRM114. You supply a set of statistics files, and which ones you want to consider in each class. CRM114 then tells you not just which class was the winner, but which file _independently_ was the winner. So you can do an N-way split. This works embarassingly well. I'm not using it personally, but it was a feature request and the user made happy gushy noises after he tested it. So, the general technique is quite useful. -Bill Yerazunis
On Wednesday 29 October 2003 16:51, Bill Yerazunis wrote:
So you can do an N-way split. This works embarassingly well.
I hacked the nway.py script a bit, and use spambayes to predict which e-mails I'm going to reply. This works well enough in the sense that spambayes is mostly able to separate interesting non-spams from less interesting. (I have two spambayes filters, one for spam and one for this little experiment.) -- Janne
Janne> I hacked the nway.py script a bit... Anything of general usefulness? If so, could you post a patch on SF? Skip
Skip Montanaro <skip@pobox.com> writes:
Janne> I hacked the nway.py script a bit...
Anything of general usefulness? If so, could you post a patch on SF?
Basically I just reduced the nway.py script back to two-way again, with forced choice (split at p=0.5) and a custom score header. Unpolished and not generally useful, for example because there is no corresponding custom "trained" header for this filter. (I retrain by copying messages into separate directories by shell scripts, and then train from scratch.) But the idea of training by the answered status seems to be at least somewhat useful. It would be nice if several filters, n-way or two-way, could be configured in an integrated fashion. But as I look at the traffic of the spambayes list with all the Windows-specific problems etc., I perfectly understand if this kind of feature is not very high on the priority list. :) And currently I'm too busy and lazy to do it myself. -- Janne
participants (5)
-
Bill Yerazunis -
Gerrit Holl -
Janne Sinkkonen -
Janne Sinkkonen -
Skip Montanaro