[Spambayes] Bugs? Yup. Fixes? Definitely. Downloadable installer and patches? Yes.
Thomas Hruska
thruska at cubiclesoft.com
Tue Dec 25 20:14:28 CET 2007
[For those interested in an installable build of 1.0.4 with my fixes
and/or want to see the source code modifications I've made, scroll down]
While applying the modifications I suggested a couple weeks ago, I'm
fairly (99%) certain I've run into at least one critical bug in the
Spambayes 1.0.4 POP3 proxy. If you use the proxy, read on.
1) Messages are not saved in their original format. There is actually
a hack that _should NOT exist_ in Corpus.py that is referenced from
ProxyUI.py:
----- ProxyUI.py -----
# fromCache is a fix for sf #851785.
# See the comments in Corpus.py
targetCorpus.takeMessage(id, sourceCorpus,
fromCache=True)
----- ProxyUI.py -----
----- Corpus.py -----
# If the notate_to or notate_subject options are set, then the
# message in the cache has this information, and it will get used
# in training, which is not ideal. So if that option is set, strip
# that data before training. The only time I can see this failing
# is if the option is changed at some point, so older messages
# don't have the notation, but some other program did do the same
# notation, which would be lost. This shouldn't be a big deal,
# though.
if fromCache:
...Modify message to attempt to make it look like the original...
----- Corpus.py -----
This hack is WRONG. Training on messages that include the modified
headers is also WRONG (as far as I can tell, the code doesn't undo that
either - but having to undo anything in the first place is WRONG).
Training should only be done using the original, unmodified message. It
is difficult to impossible to revert a modified message back to its
original state.
2) sb_server.py classifies a message and modifies the headers but does
not save a copy of the original message. Instead, it saves the modified
message. It should save a copy of the original message for training
purposes.
3) message.py does not offer any method of retrieving/storing a copy of
the original message. This seems like the most logical place but would
probably be hard to implement here.
My improvement/fix: Due to the nature of how the POP3 proxy operates,
it would be seriously detrimental to either performance or usability if
the modified message were eliminated altogether. Also, changing how
messages are stored would probably create a disaster of modifications to
the entire code base. Therefore, the simplest (and only _somewhat_
hacky) solution is to create an 'unknown-orig' ExpiryFileCorpus cache
that tracks the original message for training purposes.
For those who are interested in my solution and use the POP3 proxy,
here's the Windows installer version:
http://www.cubiclesoft.com/Unrelated/spambayes-1.0.4.exe
That applies all the recommended changes I've made to date to the 1.0.4
branch. You can train on any message and Spambayes defends itself from
letting its database get too large by rejecting messages that are
already classified correctly. It does this by running the original
message through the classifier before training on the message - thus
training only on "mistakes and unsures" as per the recommendations of
the developers. Since each message trained alters the database, I've
factored this fact in as well*. You could train on 10 messages or
60,000 messages and Spambayes will still correctly pick and choose what
to train on. In layman's terms: Spambayes is smarter than before.
* The changes I've made allow you to train on more than one message at a
time. If you've been following along with my recent rants, you know
that the existing 1.0.4 has some major statistical issues with training
beyond one message at a time. If you train on a message that would
already be classified that way, the database flattens out over time and
Spambayes will eventually not be able to figure out what is ham and what
is spam. Additionally, really large databases (e.g. 30,000+ messages)
have performance issues.
Consider wiping your Spambayes database and staring over after
installing this - especially if you have more trained on more than 2000
messages. The entire database so far was based on training on modified
messages (not the original messages!).
There is a new configuration option in the Advanced configuration page
of the POP3 proxy for controlling the name of the directory used for the
original message cache.
For those who want to see my changes (12MB file):
http://www.cubiclesoft.com/Unrelated/spambayes-1.0.4.zip
That contains all the source code modifications I made, binaries, etc.
The two most critical modifications are in sb_server.py and ProxyUI.py.
I heavily commented my changes in ProxyUI.py. I also updated the
InnoSetup script to build properly under InnoSetup 5.2.2.
This took: 4 hours to learn the basics of Python and how Spambayes
works. 30 minutes to apply my fixes. 12 hours to get the darn thing to
build properly under Python 2.5.1. Sort of. I'm fairly certain I hosed
the Outlook add-in part of the build really good due to not having
Outlook 2000. A few hours of testing (spanned over several days).
Total time spent on getting this all working was roughly 20 hours. Time
I'd rather have spent doing something else. However, I'm quite happy
with the result.
And all this is from someone who doesn't know Python**. Probably not
the most inspiring/reassuring statement, but I'll leave it up to the
developers to decide if my modifications are sound enough to merge into
the 1.0.4 branch (and hopefully good enough to consider releasing a
1.0.5). They might not go for the 2.5-specific changes due to the email
w/ py2exe package problems (case-sensitivity issues in py2exe), but it
would be nice if they did.
** I can edit code in almost any programming language without actually
knowing the language.
Merry Christmas! I can't think of a better Christmas present than to
know there will be more spam blocked next year thanks to my changes to
Spambayes.
--
Thomas Hruska
CubicleSoft President
Ph: 517-803-4197
*NEW* MyTaskFocus 1.1
Get on task. Stay on task.
http://www.CubicleSoft.com/MyTaskFocus/
More information about the SpamBayes
mailing list