[spambayes-dev] empty urls in bigram?

Tim Peters tim.one at comcast.net
Wed Dec 17 18:43:45 EST 2003


[Skip]
> I just noticed this bigram in my clues: 'bi:url: url:'.  If 'url:'
> would only be presented once as a clue, does it make sense to form a
> bigram with two instances of it?

Sure -- why not?  The same thing might happen to "really really" in

    The only product that makes your toes really really big!

Since repetition is a form of advertising hyperbole (FREE FREE FREE!), I
like the chance to catch it this way.  You could try removing the
possibility and running large-scale tests both ways, but I think there are
more basic questions about the unibi approach open now.  Note that we
*won't* score more than one instance of "really really" per message --
bigram clues are subjected to the same duplicate-squashing as unigram clues.


> What does an empty "url:" token mean?

It doesn't *mean* anything <wink>.  Staring at the code, looks like it's
produced if and only if a URL contains two adjacent characters from this
set:

    ;?:@&=+,$.

So 'bi:url: url:' would come from three adjacent characters in that set.
Sounds spammy to me.




More information about the spambayes-dev mailing list