[Spambayes] Tony Meyer - Training question

Sun Sep 18 10:58:12 CEST 2005

> IMHO, I would tell your users to replace the  
> default_bayes_customize.ini
> with my settings below *smile*, as I found that more tokens the  
> better for
> my mail stream.

This certainly falls outside the realm of simple, however.  I'm also  
not convinced that all of these options are good for everyone (or  
they would default to 'on').

Enabling all the options is also not something I'd recommend.  As  
part of the 2005 TREC spam track, one of the SpamBayes runs submitted  
included enabling all boolean options (except the slurping ones).   
Results from TREC aren't complete yet, but initial testing indicates  
that this run performs worse than running with defaults.

[...]
> [Classifier]
>
> x-use_bigrams: True

*Most* tests have indicated that the windowing bi-gram scheme is  
better than straight unigrams.  There have been a few cases where  
this was not true, however.  It does also vastly increase the  
database size.  It's certainly a good technique (and in 1.1 is no  
longer experimental), and if people are going to go to the effort of  
customizing the tokenization/classification options, this would be  
the best choice.

> max_discriminators: 150

This is already the default, so isn't needed (it has no effect).

> replace_nonascii_chars: True
> record_header_absence: True

These two are specifically enabled in the Outlook plug-in by default  
(although including them in the file is necessary if it's replaced).   
replace_nonascii_chars is probably a bad idea for anyone that  
receives non-English ham, and probably a good idea for anyone else.   
record_header_absence has also had mixed results.

> x-fancy_url_recognition: True
> x-pick_apart_urls: True

These are experimental options; as such they haven't had the  
extensive testing that other options have.  It's not clear yet  
whether these are a good idea for any user or not.

> x-reduce_habeas_headers: True
> x-search_for_habeas_headers: True

It's pretty clear that Habeas's headers are a failed experiement.   
These options probably aren't worth including, and are likely to be  
removed in a future release.

> basic_header_tokenize: True
> basic_header_skip: date x-.* domainkey-signature

Testing hasn't shown that basic_header_tokenize is a good idea.  Is  
there a reason you turned it on?

> octet_prefix_size: 5

This is the default; it will have no effect.

> mine_received_headers: True

As long as the training data is from the user, this should help.

> address_headers: from sender reply-to errors-to

I don't have any testing to hand about this, but I doubt that  
removing "to" and "cc" from the headers that are tokenized is a good  
idea.  For me, at least, the data in the "to" and "cc" headers is  
definitely a good indicator of whether the message is ham/spam; I  
would expect this would be the case for many people.  Adding errors- 
to might help; I don't know if any testing has been done on that.

> generate_long_skips: True

This is the default; it will have no effect.

> skip_max_word_size: 50

I believe that (in the early days) there was a lot of testing to  
determine what the best minimum and maximum token sizes were.  50 is  
a *lot* better than the default 12 - do you really have many strong  
tokens longer than 12?

> [URLRetriever]
>
> x-cache_directory: url-cache
> x-cache_expiry_days: 31
> x-only_slurp_base: True
> x-slurp_urls: True
> x-web_prefix:web:

I would not recommend enabling these without understanding what they  
do.  The main issue is that as a result of enabling them, SpamBayes  
will be downloading a lot of extra material - for those where  
connection speed or bandwidth are issues, this might not be a good  
step.  It's also not at all clear that they are beneficial - without  
the only_slurp_base option, testing generally indicates good results,  
but that means that any 'bugs' will be triggered.  With the  
only_slurp_base option, results are mixed, leaning towards negative.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.