[Spambayes] training

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Wed Feb 19 08:23:09 EST 2003

2/19/2003 4:48:36 AM, "Mark Hammond" <mhammond at skippinet.com.au> wrote:

>> > Which, coincidently, leads us to what I have been advocating 
>> > for some time <wink>.
>> :)
>> > The core spambayes code should persist 
>> > the word database as now, but also a basic "message 
>> > database".
>> Do you mean one like pop3proxy's cache?  i.e. one that 
>> expires messages over a certain age?
>I actually just meant a simple msg_id->trained_as_spam dictionary - just a
>memory that a message had previously been trained as ham/spam, so a need to
>untrain and multiple requests for the same message can be detected.  This is
>user-proof in the face of I-double-click-everywhere type users <wink>

This is a great idea.  The filesystem based stuff (pop3proxy) will need to 
keep a permanent copy of mails that have been trained in order for this to 
work, but I don't have a problem with that.

>> > If this sounds OK, I've a further idea I will expand in email :)
>I meant to say "private email", but the list is quiet at the moment
>I was thinking that we could possibly abstract the database out one step
>more.  Have a single "database manager" that maintains a few 'databases' -
>really just discrete tables, with no joins, in standard database parlance.
>What I'm trying to get at is that if we could have 2 dictionaries (existing
>word dictionary, plus one more "msg_id->how_was_trained") stored in a single
>file, and maybe even the possibility of additional "application defined"
>dictionaries (such as random config info) in that same file, life would be
>pretty peachy :)
>If we talk in terms of pickles, imagine:
>database['bayes'] = existing_bayes_pickle
>database['training'] = dict_I_proposed_above
>database['outlook_ui'] = dict_for_outlook_ui_options

We might replace Options.py with a pickled dictionary pointed to by this 
dictionary.  Or at least the user configurable stuff.  The configurator for 
bayescustomize.ini is an enormous pain, and getting worse as I try to write 
'installers' for various pop3 mailers.

>And 'database' is pickled.  I see no reason this couldn't also work for
>bsdbd.  I am proposing that Corpus.py automatically manage the 'bayes' and
>'training' keys of the database, but leave others for applications.  Bayes
>itself persists the entire database.  Some naming convention would be just
>fine too :)

Very kewl ideas.

Getting-over-my-God's-gift-to-opensourcedness-ly, TimS


c'est moi - TimS

More information about the Spambayes mailing list