[Spambayes] Tony Meyer - Training question
Tony Meyer
tameyer at ihug.co.nz
Sun Sep 18 10:58:12 CEST 2005
> IMHO, I would tell your users to replace the
> default_bayes_customize.ini
> with my settings below *smile*, as I found that more tokens the
> better for
> my mail stream.
This certainly falls outside the realm of simple, however. I'm also
not convinced that all of these options are good for everyone (or
they would default to 'on').
Enabling all the options is also not something I'd recommend. As
part of the 2005 TREC spam track, one of the SpamBayes runs submitted
included enabling all boolean options (except the slurping ones).
Results from TREC aren't complete yet, but initial testing indicates
that this run performs worse than running with defaults.
[...]
> [Classifier]
>
> x-use_bigrams: True
*Most* tests have indicated that the windowing bi-gram scheme is
better than straight unigrams. There have been a few cases where
this was not true, however. It does also vastly increase the
database size. It's certainly a good technique (and in 1.1 is no
longer experimental), and if people are going to go to the effort of
customizing the tokenization/classification options, this would be
the best choice.
> max_discriminators: 150
This is already the default, so isn't needed (it has no effect).
> replace_nonascii_chars: True
> record_header_absence: True
These two are specifically enabled in the Outlook plug-in by default
(although including them in the file is necessary if it's replaced).
replace_nonascii_chars is probably a bad idea for anyone that
receives non-English ham, and probably a good idea for anyone else.
record_header_absence has also had mixed results.
> x-fancy_url_recognition: True
> x-pick_apart_urls: True
These are experimental options; as such they haven't had the
extensive testing that other options have. It's not clear yet
whether these are a good idea for any user or not.
> x-reduce_habeas_headers: True
> x-search_for_habeas_headers: True
It's pretty clear that Habeas's headers are a failed experiement.
These options probably aren't worth including, and are likely to be
removed in a future release.
> basic_header_tokenize: True
> basic_header_skip: date x-.* domainkey-signature
Testing hasn't shown that basic_header_tokenize is a good idea. Is
there a reason you turned it on?
> octet_prefix_size: 5
This is the default; it will have no effect.
> mine_received_headers: True
As long as the training data is from the user, this should help.
> address_headers: from sender reply-to errors-to
I don't have any testing to hand about this, but I doubt that
removing "to" and "cc" from the headers that are tokenized is a good
idea. For me, at least, the data in the "to" and "cc" headers is
definitely a good indicator of whether the message is ham/spam; I
would expect this would be the case for many people. Adding errors-
to might help; I don't know if any testing has been done on that.
> generate_long_skips: True
This is the default; it will have no effect.
> skip_max_word_size: 50
I believe that (in the early days) there was a lot of testing to
determine what the best minimum and maximum token sizes were. 50 is
a *lot* better than the default 12 - do you really have many strong
tokens longer than 12?
> [URLRetriever]
>
> x-cache_directory: url-cache
> x-cache_expiry_days: 31
> x-only_slurp_base: True
> x-slurp_urls: True
> x-web_prefix:web:
I would not recommend enabling these without understanding what they
do. The main issue is that as a result of enabling them, SpamBayes
will be downloading a lot of extra material - for those where
connection speed or bandwidth are issues, this might not be a good
step. It's also not at all clear that they are beneficial - without
the only_slurp_base option, testing generally indicates good results,
but that means that any 'bugs' will be triggered. With the
only_slurp_base option, results are mixed, leaning towards negative.
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the Spambayes
mailing list