[Spambayes] Training via web interface in 1.1a6 doesn't work?

Carl Colijn c.colijn at twologs.com
Thu Oct 14 11:34:11 CEST 2010


  Hi all,

I've used the ThunderBayes ThunderBird plugin with SpamBayes 1.0.4 in 
the past, and made a copy of every mail I trained the filter with in a 
Training-Spam and Training-Ham folder.  This allowed me to get the 
training back to where I left off after a re-install or database 
corruption.  I just had to go to the SpamBayes web interface, select a 
Thunderbird training folder file and press the "train as xxx" button.

The original ThunderBayes extension doesn't work anymore with 
ThunderBird 3.x, so I decided to use a plain SpamBayes installation 
after installing ThunderBird 3.x (can't live without SpamBayes anymore 
;) ).  I set up SpamBayes 1.1a6 seperately and connected ThunderBird to 
it.  When I then started to let SpamBayes train on my training folders 
it didn't work.  No errors in the web interface, training seemed to go 
OK (uploaded ok, Training... Saving... Done!) but the statistics on the 
main page ("Total emails trained") didn't reflect the newly trained 
mails (neither ham nor spam).

Searching a bit more I found the original ThunderBayes plugin (now 
abandoned) had been continued as a ThunderBayes++ plugin - and it even 
includes the latest SpamBayes version as well in stead of the ancient 
1.0.4 :)  So I've now set everything up again using the ThunderBayes 
plugin (after first uninstalling the separate SpamBayes version), and 
started training again hoping this would have fixed it.  But it didn't: 
I'm still stuck with the same situation - training seems to go OK but 
the trained-on mails don't arrive in the database.

Does anyone have an idea what could be going wrong?  I assume it's some 
silly configuration issue, but I've already tweaked it for quite a few 
hours now and can't get it right.  I've attached a clean config file set 
after training on 1 ham message for anyone willing to give it a go as well.

Some observations:
- I run Windows XP SP3 en-us with the SpamBayes 1.1a6 version shipped 
with ThunderBayes++ - databases are of the pickle version
- My training databases contain +- 250 ham, +- 6000 spam
- When I start clean (close ThunderBird/SpamBayes, delete the cache & 
training databases) it re-creates them OK when restarted again
- After a restart it claims there are 0 trained messages (of course)
- When I upload the Thunderbird ham training folder file it seems to 
process it correctly but after it's done the counter still remains at "0 
trained messages"
- hamme.db doesn't grow either (56 bytes after a clean database 
recreation, still 56 bytes after training)
- There's no error in the log
- I've enabled caching messages (ThunderBayes by default has it off I 
think), and the uploaded messages do get extracted as separate messages 
in the cache - messageinfo.db indeed also grows
- "Review messages" sometimes shows the uploaded messages, but not 
consistently - they did appear a few times after I tweaked and restarted 
and such
- Copy/pasting a separate mail with headers and training on that has the 
same effect
- When I let it train on my Spam folder (with 6000+ mails in it) it is 
seriously busy - CPU at 100% for more than 10 minutes - so it must think 
it's doing something
- Consecutively letting it train on the small Ham folder (250 messages) 
now takes far more time - the 6000+ spam messages it processed earlier 
must have influenced something
- When I look at the "More statistics" page it the uploaded messages 
_do_ get reflected in the "Unsures trained as good" and "Unsures trained 
as spam" statistics
- Training via the ThunderBayes buttons in ThunderBird _do_ raise the 
"trained on" counters - what does it do that I cannot do?
- There are no SMTP proxy details info specified in the settings - I 
assume ThunderBayes++ passes the ham/spam training via the web interface 
as well?
- Starting from scratch again (delete db's, clear email cache) and 
selecting "bsddb"as db type didn't change a thing

Here's the spambayes.ini file I use:

[Headers]
include_score:True
notate_subject:
[Storage]
persistent_use_database:pickle
persistent_storage_file:databases/hammie.db
cache_expiry_days:2
cache_messages:True
no_cache_bulk_ham:False
messageinfo_storage_file:databases/messageinfo.db
ham_cache:cache/ham
spam_cache:cache/spam
unknown_cache:cache/unsure
[html_ui]
default_spam_action:defer
display_score:True
[pop3proxy]
use_ssl:automatic
listen_ports:53100,53101,53102
remote_servers:xxx.xxx.com:110,xxx.xxx.com:995,xxx.xxx.nl:110

-- 
Kind regards,
Carl Colijn

TwoLogs - IT Services and Product Development
A natural choice!
http://www.twologs.com
TimeTraces: the powerful and versatile time registration system!
http://timetraces.twologs.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/spambayes/attachments/20101014/bc1a0e2a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: files.zip
Type: application/octet-stream
Size: 11175 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/spambayes/attachments/20101014/bc1a0e2a/attachment.obj>


More information about the SpamBayes mailing list