[Spambayes] More HTML strippage.

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 00:45:28 -0400


[Neil Schemenauer]
>> I think it would be better to generate:
>>
>>     header:num_recip_1
>>     header:num_recip_2
>>     header:num_recip_3
>>     header:num_recip_4
>>     ...
>>     header:num_recip_21
>>
>> We should probably do the same thing when counting headers.  I'll give
>> it a try.


> Not so good.  This seems to work good though:
>
>     to:2**4
>
> IOW, log2(n).

Cool!  Thank you.  I'm running tests now (it will take a while to finish).

> The idea is that going from 8 to 16 is about that same as
> going from 1 to 2.

Makes good sense to me!  I had wondered in the past, but not pursued,
whether a log gimmick would have been better in Graham's "count multiple
words in a msg multiple times during training" scheme too.

> I wonder if we should to this for 'skip:' as well.

As the comments say <wink>, I have no idea what skip accomplishes, just that
it improves error rates, and that every variation I've ever tried did worse.
I haven't tried a logarithmic version, though.  I suspect that whenever skip
helps, it's exposing a systematic weakness of the tokenizer, but it's very
time-consuming to analyze things at that level.  Feel encouraged to
experiment!  I don't like skip, but the data says "it works" (or, more
accurately, that it helped the last time I ran a controlled experiment
changing just it).