[spambayes-dev] Deprecated options
Matthew Dixon Cowles
matt at mondoinfo.com
Wed Aug 4 00:17:16 CEST 2004
>> But creating synthetic tokens for the IP address of a URL's host
>> part is effective for me and that's a small
>> hack on top the code that implements x-pick_apart_urls.
> I assume you're doing a DNS lookup on the hostname in the url, then?
Yes, exactly so.
> The potential problem I see with that is that the IP address can
> change if you lookup the same hostname again. The system could be
> switched to a different IP, or they might be using a round-robin
> DNS for load balancing.
The code creates a token (actually multiple tokens) for each IP, so
the round-robin aspect doesn't apply, but everything else you say is
> SpamBayes training relies on the tokenizer generating the exact
> same set of tokens every time a message is parsed. If you make a
> mistake in training and later correct that mistake, SpamBayes needs
> to remove the trained tokens from the incorrect corpus and it does
> that by tokenizing the message again. If you introduce dynamic
> information into the token stream, you might end up trying to
> remove a different set of tokens than what was originally added,
> which can potentially corrupt your training database.
I agree. Nevertheless, I'm glad to use it because it works. Between
mine_received_headers and using tokens from the URLs' IPs, even pure
word-salad rarely gets through.
More information about the spambayes-dev