[Spambayes] tokenizing to: and cc:

Fri Nov 1 20:39:01 2002

[Skip Montanaro]
> I made a change to tokenizer.py (just locally for now) to tokenize the
> domains mentioned in to: and cc: headers:
>
> ...
>           for field in ('to', 'cc'):
>               count = 0
>               for addrs in msg.get_all(field, []):
> !                 addrs = map(email.Utils.parseaddr, addrs.split(','))
> !                 count += len(addrs)
> !                 if options.generate_to_domains:
> !                     for name,addr in addrs:
> !                         yield '%s:@%s' % (field,
> !                                           (addr.split("@")[1:] or
> !                                            ["local"])[0])
>               if count > 0:
>                   yield '%s:2**%d' % (field, round(log2(count)))

> ...
> I think this should be turned into a separate pass over the to: and cc:
> headers to simplify the logic and move the option test out of the inner
> loop.

The time cost is trivial.

> Here's a summary of the results:
>
>     % python table.py base.txt to.txt
>     -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
>     ...
>     filename:     base      to
>     ham:spam:  2000:2000
>                        2000:2000
>     fp total:        8       7
>     fp %:         0.40    0.35
>     fn total:       21      22
>     fn %:         1.05    1.10
>     unsure t:       95      87
>     unsure %:     2.38    2.17
>     real cost: $120.00 $109.40
>     best cost:  $79.80  $79.60

This says you could have got more benefit simply by changing your ham_cutoff
and spam_cutoff values.  If you had picked the best possible in both cases,
the total difference would have been 1 unsure msg (79.60 - 79.60 = 0.20, the
default "cost" of one unsure).  See your "all runs" histograms for more info
about that.

>     h mean:       0.79    0.79
>     h sdev:       7.43    7.43
>     s mean:      97.41   97.46
>     s sdev:      12.53   12.53
>     mean diff:   96.62   96.67
>     k:            4.84    4.84

> ...
> All things considered, I think it did pretty well for me.  It dropped the
> unsure percentage a bit

Changing cutoffs can do the same.

> and spread the ham and spam means a bit further apart.

A change of 0.05 relative to 96.62 is insignificant.

> As I mentioned earlier, I think this option may be useful
> for people with inactive, but still operational, email addresses.
> Over time, those addresses will tend to get nothing but spam.  (It
> would thus be important to not train on messages sent to those
? addresses before or shortly after during abandonment.)
>
> Should I rework the patch and check it in?

I'm -0, but would become +1 if it really helped someone, or nailed cases
that can't be nailed via a more general gimmick.

For example, what if you were to introduce an option to fully tokenize To:
and Cc: addresses instead?  We don't even catch "Undisclosed Recipients"
now.  We ignore addressees by default because it becomes a killer-strong
clue for bogus reasons when training with mixed-source corpora (e.g.,

    To: bruceg@whatever

is in thousands & thousands of BruceG's spams).