[Spambayes] spamprob combining

Thu, 10 Oct 2002 12:13:49 -0700

In message:  <LNBBLJKPBEHFEDALKOLCOEIKBKAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> I'll run this one after I get done with my initial clt tests
>> (which are taking about 4.5 hours each :-/ ).
>
>Use less data?

Yes, I could go back to using only 5 sets instead of 10... but then
my results would be a bit less comparable with other runs I've done.

>> I can't really say anything else, yet, but clt seems _much_ slower
>> than the default classifier.
>
>I haven't really noticed that.  If you're using your "--trainstyle full"
>patch with timcv, then, yes, it would be enormously slower -- timcv gets
>enormous *efficiency* benefits (both instruction-count and temporal cache
>locality) out of incremental learning and unlearning.
>
>The "third training pass" unique to the clt methods also doubles the
>training time (each msg in the training data is tokenized once to update the
>wordprobs, and then a second time to compute the clt ham and spam population
>statistics).

Is it worth caching the token streams somehow?  (I'm thinking not,
since this is still in the research-project stage...)

Quite possibly the problem is that I'm running all this on a PII-300
with only 64M RAM, which is also running X (but not Gnome or KDE; I'm
a hardend twm user!)...

- Alex