[spambayes-dev] A URL experiment
Tim Peters
tim.one at comcast.net
Tue Dec 30 22:46:36 EST 2003
[Tony Meyer, tries the patches]
Thanks, Tony!
> ...
> Rather like Tim's results, really, at least to my ignorant eyes.
The results are both as weakly positive as things get, but at least neither
patch is doing any harm. As before, I'd rather see Skip try to deal with %
escapes the way my patch did -- that's a common obfuscation trick, and I bet
it accounts for the small reduction in Unsures you saw. My patch should do
a lot more to penalize that trick than Skip's.
Both patches tokenize the de-obfuscated URL, so they're a wash in that
respect.
Skip's patch also exposes higher-level concepts to the classifier, like
"non-standard port number". I don't see that often, but when I do it's
usually in email from my work account (e.g., trying to get me to preview a
pre-release site change, accessed via a non-standard port so it doesn't
interfere with the production site). That's OK, though: *my* classifier
will learn that's a ham clue in my email mix -- so it goes.
Since everyone is getting some good out of Skip's changes (and I don't think
his treatment of % escapes is making a difference), and also getting some
good out of mine (which don't try to do anything except get some good of %
escapes), combining the two will do better than either, or cancel each other
out <0.5 wink>.
More information about the spambayes-dev
mailing list