[spambayes-dev] empty urls in bigram?
Tim Peters
tim.one at comcast.net
Wed Dec 17 18:43:45 EST 2003
[Skip]
> I just noticed this bigram in my clues: 'bi:url: url:'. If 'url:'
> would only be presented once as a clue, does it make sense to form a
> bigram with two instances of it?
Sure -- why not? The same thing might happen to "really really" in
The only product that makes your toes really really big!
Since repetition is a form of advertising hyperbole (FREE FREE FREE!), I
like the chance to catch it this way. You could try removing the
possibility and running large-scale tests both ways, but I think there are
more basic questions about the unibi approach open now. Note that we
*won't* score more than one instance of "really really" per message --
bigram clues are subjected to the same duplicate-squashing as unigram clues.
> What does an empty "url:" token mean?
It doesn't *mean* anything <wink>. Staring at the code, looks like it's
produced if and only if a URL contains two adjacent characters from this
set:
;?:@&=+,$.
So 'bi:url: url:' would come from three adjacent characters in that set.
Sounds spammy to me.
More information about the spambayes-dev
mailing list