[spambayes-dev] Deprecated options

Matthew Dixon Cowles matt at mondoinfo.com
Wed Aug 4 00:17:16 CEST 2004


[me]
>> But creating synthetic tokens for the IP address of a URL's host
>> part is effective for me and that's a small
>> hack on top the code that implements x-pick_apart_urls.

[Kenny Pitt]
> I assume you're doing a DNS lookup on the hostname in the url, then?

Yes, exactly so.

> The potential problem I see with that is that the IP address can
> change if you lookup the same hostname again.  The system could be
> switched to a different IP, or they might be using a round-robin
> DNS for load balancing.

The code creates a token (actually multiple tokens) for each IP, so
the round-robin aspect doesn't apply, but everything else you say is
quite correct.

> SpamBayes training relies on the tokenizer generating the exact
> same set of tokens every time a message is parsed.  If you make a
> mistake in training and later correct that mistake, SpamBayes needs
> to remove the trained tokens from the incorrect corpus and it does
> that by tokenizing the message again. If you introduce dynamic
> information into the token stream, you might end up trying to
> remove a different set of tokens than what was originally added,
> which can potentially corrupt your training database.

I agree. Nevertheless, I'm glad to use it because it works. Between
mine_received_headers and using tokens from the URLs' IPs, even pure
word-salad rarely gets through.

Regards,
Matt



More information about the spambayes-dev mailing list