[spambayes-dev] Deprecated options
kennypitt at hotmail.com
Tue Aug 3 22:58:14 CEST 2004
Matthew Dixon Cowles wrote:
> Cross-validation showed that x-pick_apart_urls wasn't particularly
> effective on my mail. But creating synthetic tokens for the IP
> address of a URL's host part is effective for me and that's a small
> hack on top the code that implements x-pick_apart_urls.
I assume you're doing a DNS lookup on the hostname in the url, then?
The potential problem I see with that is that the IP address can change if
you lookup the same hostname again. The system could be switched to a
different IP, or they might be using a round-robin DNS for load balancing.
SpamBayes training relies on the tokenizer generating the exact same set of
tokens every time a message is parsed. If you make a mistake in training
and later correct that mistake, SpamBayes needs to remove the trained tokens
from the incorrect corpus and it does that by tokenizing the message again.
If you introduce dynamic information into the token stream, you might end up
trying to remove a different set of tokens than what was originally added,
which can potentially corrupt your training database.
More information about the spambayes-dev