[Spambayes] training

Mark Hammond mhammond at skippinet.com.au
Wed Feb 19 21:48:36 EST 2003


> > Which, coincidently, leads us to what I have been advocating 
> > for some time <wink>.
> 
> :)
> 
> > The core spambayes code should persist 
> > the word database as now, but also a basic "message 
> > database".
> 
> Do you mean one like pop3proxy's cache?  i.e. one that 
> expires messages over a certain age?

I actually just meant a simple msg_id->trained_as_spam dictionary - just a
memory that a message had previously been trained as ham/spam, so a need to
untrain and multiple requests for the same message can be detected.  This is
user-proof in the face of I-double-click-everywhere type users <wink>

> > If this sounds OK, I've a further idea I will expand in email :)

I meant to say "private email", but the list is quiet at the moment
<wink>...

I was thinking that we could possibly abstract the database out one step
more.  Have a single "database manager" that maintains a few 'databases' -
really just discrete tables, with no joins, in standard database parlance.
What I'm trying to get at is that if we could have 2 dictionaries (existing
word dictionary, plus one more "msg_id->how_was_trained") stored in a single
file, and maybe even the possibility of additional "application defined"
dictionaries (such as random config info) in that same file, life would be
pretty peachy :)

If we talk in terms of pickles, imagine:
database['bayes'] = existing_bayes_pickle
database['training'] = dict_I_proposed_above
database['outlook_ui'] = dict_for_outlook_ui_options

And 'database' is pickled.  I see no reason this couldn't also work for
bsdbd.  I am proposing that Corpus.py automatically manage the 'bayes' and
'training' keys of the database, but leave others for applications.  Bayes
itself persists the entire database.  Some naming convention would be just
fine too :)

Never-satisfied-ly,

Mark.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 2652 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030219/94d4f844/winmail.bin


More information about the Spambayes mailing list