[Spambayes] Supporting new database type in classifier
bkc at murkworks.com
Sat Feb 14 15:01:28 EST 2004
I'm working on a new type of storage that requires closer integration with classifier
_getclues and _add_msg, _remove_msg.
For example, this code fragment in classifier._getclues:
# The all-unigram scheme just scores the tokens as-is. A Set()
# is used to weed out duplicates at high speed.
clues = 
push = clues.append
for word in Set(wordstream):
tup = self._worddistanceget(word)
if tup >= mindist:
Would essentially be pushed into the database module. For efficiency, the database
module must have the entire wordstream to work with.
_worddistanceget could be passed into the database as a callback, or the code could
be replicated at the database level. That is, _worddistanceget calls _wordinfoget AND
performs calculations. I'd prefer a function that accepts the token info (nham, nspam)
and does the calculations w/o being coupled to _wordinfoget.
Overiding _wordinfoget in a subclass doesn't work for me, because that function only
gets called with one word at a time.
I could override _getclues, but then I'd have to recreate the bigram stuff which is quite a
So, my first question is, could the bigram stuff be structured as a 'filter' before
_getclues (modifying the wordstream) and before _add and _remove_msg?
Second, what's the best way to restructure classifier so that a storage subclass can
deal with entire wordstreams in one lump if it so chooses?
Brad Clements, bkc at murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
http://www.wecanstopspam.org/ AOL-IM: BKClements
More information about the Spambayes