[Spambayes] Mining the headers

Tim Peters tim.one@comcast.net
Sun Oct 27 22:31:36 2002


[Skip Montanaro]
> Done.  Note that I deleted the mine_date_headers option.  It was just a
> gatekeeper for the other two.  Seemed pointless to me.  Here's my latest
> run.  The first run was the default.  My dates.ini file is
>
>     [Tokenizer]
>     generate_time_buckets: True
>     extract_dow: True

Skip, I think there's a bug in the extract_dow code.  On a quick python.org
test, here are the dow tokens left behind in the database:

              #ham  #spam        spamprob
'dow:0'          2      7  0.890542594688
'dow:1'          3      7  0.854937008074
'dow:2'        725     71  0.220827483069
'dow:3'       1038    261  0.420993872704
'dow:4'        845    234  0.444677806501
'dow:5'        126    196  0.81766035841
'dow:6'          0    137  0.998363041106
'dow:invalid' 2741    946  0.499472081328

Those only trained on half a week's traffic, so it's not surprising that
half the days are virtually empty.  What is surprising is that every ham
trained on, and all but 2 of the spam, generated a dow:invalid token.
Because the

                for fmt in self.date_formats:

loop has no early exit, its "else:" clause always executes.  If I repair
that, dow:invalid becomes a mild spam clue:

'dow:invalid'    2     33  0.97338283678

I say it's "mild" just because it's infrequent in absolute terms.

I'll check that change in anyway, and run a better test.