[Spambayes] Mining the headers
Tim Peters
tim.one@comcast.net
Sun Oct 27 22:31:36 2002
[Skip Montanaro]
> Done. Note that I deleted the mine_date_headers option. It was just a
> gatekeeper for the other two. Seemed pointless to me. Here's my latest
> run. The first run was the default. My dates.ini file is
>
> [Tokenizer]
> generate_time_buckets: True
> extract_dow: True
Skip, I think there's a bug in the extract_dow code. On a quick python.org
test, here are the dow tokens left behind in the database:
#ham #spam spamprob
'dow:0' 2 7 0.890542594688
'dow:1' 3 7 0.854937008074
'dow:2' 725 71 0.220827483069
'dow:3' 1038 261 0.420993872704
'dow:4' 845 234 0.444677806501
'dow:5' 126 196 0.81766035841
'dow:6' 0 137 0.998363041106
'dow:invalid' 2741 946 0.499472081328
Those only trained on half a week's traffic, so it's not surprising that
half the days are virtually empty. What is surprising is that every ham
trained on, and all but 2 of the spam, generated a dow:invalid token.
Because the
for fmt in self.date_formats:
loop has no early exit, its "else:" clause always executes. If I repair
that, dow:invalid becomes a mild spam clue:
'dow:invalid' 2 33 0.97338283678
I say it's "mild" just because it's infrequent in absolute terms.
I'll check that change in anyway, and run a better test.