[spambayes-dev] 1070 spam, 1 false positive

Tim Peters tim.one at comcast.net
Tue Jun 24 15:31:29 EDT 2003


>>> basic_header_tokenize: True

>> That's a dangerous one -- although I think you've already figured
>> out why the hard way.

>>> basic_header_skip: received envelope-to delivered-to delivery-date
>>> x-spam-flag x-spam-status content-type list-*

>> The problem is that any random header line can yield a misleading
>> clue by accident, and there may be no end of adding to this list.

[Greg Ward]
> The thing is, every header on that list is there for a very good
> reason.

Sure.  I'm not saying basic_header_skip is dangerous, I'm saying
basic_header_tokenize is dangerous.  That's why it's off by default.
basic_header_skip has no effect unless basic_header_skip is forced True.

> But I can see your point: every addition *also* has a very
> good reason for it.  Hmmm.  I guess I should try it without
> basic_header_tokenize at all and see how it does.

In early tests of mine, enabling basic_header_tokenize gave worse results.
It's an experiment (added, IIRC, by Jeremy) which didn't get enough testing
either way to decide.

>>>         'date:2003': 0.663
>>>         'date:Jun': 0.681

>> Any idea where those came from?  They have the form of synthesized
>> tokens (keyword colon stuff), but I don't recall anything in the
>> tokenizer that synthesizes tokens with keyword "date".

I understand these now -- they're a side effect of enabling
basic_header_tokenize.

> Beats me.  In my "default" corpus (right now: 418 ham, 583 spam,
> roughly half of both from June 2003), these tokens are unsurprisingly
> quite common:
>
>>>> h = hammie.open("db/default.db", usedb=True)
>>>> h.bayes.db["date:2003"]
> (283, 192)
>>>> h.bayes.db['date:Jun']
> (317, 193)
>
> So *some* bit of code in there is tokenizing the "Date:" header.

Yes, basic_header_tokenize tokenizes all header lines, except for those
squashed via basic_header_skip.

> Seems like a good idea to me, since junk mail often has
> non-RFC-conformant date headers.

basic_header_tokenize doesn't know anything about RFC compliance, it treats
all header lines exactly the same way (as sequences of meaningless
characters).

If you want to experiment with compliance of Date headers, try Skip's
extract_dow option.  In tests that did show a weak but highly significant
correlation between different Date times and spam-vs-ham, but it was too
weak to make any difference to bottom-line results.




More information about the spambayes-dev mailing list