[Spambayes] to From_ or not to From_?

Tue, 01 Oct 2002 19:35:14 -0400

[Neale Pickett, "From " lines]
> ...
> Here, I'll put my money where my mouth is.  My mail program writes the
> From header as an X-From: line.  I add this to my bayescustomize.ini:
>
> [Tokenizer]
> basic_header_tokenize: True
> basic_header_skip: received
>     date
>     x-[^f][^r].*

Note that this tokenizes a great many headers lines beyond just x-from.
Something like

    basic_header_skip: (?!x-from)

would have been sharper (that's a negative lookahead assertion:  it matches
iff the header name doesn't match x-from, so it skips a header line iff it's
not x-from, so it looks only at x-from -- all obvious to the most casual
observer <wink>).

> And I get this on my tiny corpus (2x5x200 messages):
>
> """
> false positive percentages
>     1.500  1.500  tied
>     1.000  1.000  tied
>     2.000  1.000  won    -50.00%
>     1.500  1.000  won    -33.33%
>     1.500  1.000  won    -33.33%
>
> won   3 times
> tied  2 times
> lost  0 times
>
> total unique fp went from 15 to 11 won    -26.67%
> mean fp % went from 1.5 to 1.1 won    -26.67%
>
> false negative percentages
>     1.500  1.000  won    -33.33%
>     0.000  0.500  lost  +(was 0)
>     1.000  1.000  tied
>     0.500  0.000  won   -100.00%
>     1.000  1.000  tied
>
> won   2 times
> tied  2 times
> lost  1 times
> """
>
> In all but one case where something changed, it was just a single
> message.  That's not a huge improvement,

*Relative to* your error rates, it was a huge improvement, but it's hard to
be confident about it because the absolute # of msgs involved is so small.
Still, that it won 3 times on f-p, and never lost, adds to the confidence
you should have that it truly helped.

> but maybe enough of one to convince someone with a larger test
> set to try it out?

I can't get away with tokenizing so many header lines; there are too many
"good clues for bad reasons" in my mixed-source data.