[Spambayes] More HTML strippage.
Tim Peters
tim.one@comcast.net
Sun, 29 Sep 2002 00:45:28 -0400
[Neil Schemenauer]
>> I think it would be better to generate:
>>
>> header:num_recip_1
>> header:num_recip_2
>> header:num_recip_3
>> header:num_recip_4
>> ...
>> header:num_recip_21
>>
>> We should probably do the same thing when counting headers. I'll give
>> it a try.
> Not so good. This seems to work good though:
>
> to:2**4
>
> IOW, log2(n).
Cool! Thank you. I'm running tests now (it will take a while to finish).
> The idea is that going from 8 to 16 is about that same as
> going from 1 to 2.
Makes good sense to me! I had wondered in the past, but not pursued,
whether a log gimmick would have been better in Graham's "count multiple
words in a msg multiple times during training" scheme too.
> I wonder if we should to this for 'skip:' as well.
As the comments say <wink>, I have no idea what skip accomplishes, just that
it improves error rates, and that every variation I've ever tried did worse.
I haven't tried a logarithmic version, though. I suspect that whenever skip
helps, it's exposing a systematic weakness of the tokenizer, but it's very
time-consuming to analyze things at that level. Feel encouraged to
experiment! I don't like skip, but the data says "it works" (or, more
accurately, that it helped the last time I ran a controlled experiment
changing just it).