[Spambayes] Supporting new database type in classifier

Brad Clements bkc at murkworks.com
Sat Feb 14 15:01:28 EST 2004


I'm working on a new type of storage that requires closer integration with classifier 
_getclues and _add_msg, _remove_msg.

For example, this code fragment in classifier._getclues:

            # The all-unigram scheme just scores the tokens as-is.  A Set()
            # is used to weed out duplicates at high speed.
            clues = []
            push = clues.append
            for word in Set(wordstream):
                tup = self._worddistanceget(word)
                if tup[0] >= mindist:
                    push(tup)
            clues.sort()

Would essentially be pushed into the database module. For efficiency, the database 
module must have the entire wordstream to work with.


_worddistanceget could be passed into the database as a callback, or the code could 
be replicated at the database level. That is, _worddistanceget calls _wordinfoget AND 
performs calculations. I'd prefer a function that accepts the token info (nham, nspam) 
and does the calculations w/o being coupled to _wordinfoget.

Overiding _wordinfoget in a subclass doesn't work for me, because that function only 
gets called with one word at a time.

I could override _getclues, but then I'd have to recreate the bigram stuff which is quite a 
lot.

So, my first question is, could the bigram stuff be structured as a 'filter' before 
_getclues (modifying the wordstream) and before _add and _remove_msg?

Second, what's the best way to restructure classifier so that a storage subclass can 
deal with entire wordstreams in one lump if it so chooses?



-- 
Brad Clements,                bkc at murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
http://www.wecanstopspam.org/                   AOL-IM: BKClements




More information about the Spambayes mailing list