[Spambayes] Supporting new database type in classifier
tim.one at comcast.net
Sat Feb 14 23:25:28 EST 2004
> I'm working on a new type of storage that requires closer
> integration with classifier _getclues and _add_msg, _remove_msg.
You'll probably get better responses on the spambayes-dev list.
> For example, this code fragment in classifier._getclues:
> # The all-unigram scheme just scores the tokens as-is. A
> Set() # is used to weed out duplicates at high speed.
> clues = 
> push = clues.append
> for word in Set(wordstream):
> tup = self._worddistanceget(word)
> if tup >= mindist:
> Would essentially be pushed into the database module. For
> efficiency, the database module must have the entire wordstream
> to work with.
I encourage you to work on a branch for now -- since most people drop most
ideas after a few weeks at most, I'm opposed to warping this part of the
code to cater to something as unlikely to be seen again as a
non-random-access database model. If you work on a branch and demonstrate
astonishing results, great, then we'll junk all other storages and adopt
> _worddistanceget could be passed into the database as a callback,
> or the code could be replicated at the database level. That is,
> _worddistanceget calls _wordinfoget AND performs calculations. I'd
> prefer a function that accepts the token info (nham, nspam)
> and does the calculations w/o being coupled to _wordinfoget.
> Overiding _wordinfoget in a subclass doesn't work for me, because
> that function only gets called with one word at a time.
> I could override _getclues, but then I'd have to recreate the
> bigram stuff which is quite a lot.
It's less than 30 lines of code (half of it is comments).
> So, my first question is, could the bigram stuff be structured as a
> 'filter' before _getclues (modifying the wordstream) and before
> _add and _remove_msg?
The bigram stuff is already a filter before _add and _remove. It could also
be done as a filter before _getclues, but not pleasantly.
> Second, what's the best way to restructure classifier so that a
> storage subclass can deal with entire wordstreams in one lump if
> it so chooses?
On a branch -- prove this is worth doing first, and don't worry about doing
it cleanly before that succeeds.
More information about the Spambayes