[Spambayes] tokenizing to: and cc:
Tim Peters
tim.one@comcast.net
Fri Nov 1 20:39:01 2002
[Skip Montanaro]
> I made a change to tokenizer.py (just locally for now) to tokenize the
> domains mentioned in to: and cc: headers:
>
> ...
> for field in ('to', 'cc'):
> count = 0
> for addrs in msg.get_all(field, []):
> ! addrs = map(email.Utils.parseaddr, addrs.split(','))
> ! count += len(addrs)
> ! if options.generate_to_domains:
> ! for name,addr in addrs:
> ! yield '%s:@%s' % (field,
> ! (addr.split("@")[1:] or
> ! ["local"])[0])
> if count > 0:
> yield '%s:2**%d' % (field, round(log2(count)))
> ...
> I think this should be turned into a separate pass over the to: and cc:
> headers to simplify the logic and move the option test out of the inner
> loop.
The time cost is trivial.
> Here's a summary of the results:
>
> % python table.py base.txt to.txt
> -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
> ...
> filename: base to
> ham:spam: 2000:2000
> 2000:2000
> fp total: 8 7
> fp %: 0.40 0.35
> fn total: 21 22
> fn %: 1.05 1.10
> unsure t: 95 87
> unsure %: 2.38 2.17
> real cost: $120.00 $109.40
> best cost: $79.80 $79.60
This says you could have got more benefit simply by changing your ham_cutoff
and spam_cutoff values. If you had picked the best possible in both cases,
the total difference would have been 1 unsure msg (79.60 - 79.60 = 0.20, the
default "cost" of one unsure). See your "all runs" histograms for more info
about that.
> h mean: 0.79 0.79
> h sdev: 7.43 7.43
> s mean: 97.41 97.46
> s sdev: 12.53 12.53
> mean diff: 96.62 96.67
> k: 4.84 4.84
> ...
> All things considered, I think it did pretty well for me. It dropped the
> unsure percentage a bit
Changing cutoffs can do the same.
> and spread the ham and spam means a bit further apart.
A change of 0.05 relative to 96.62 is insignificant.
> As I mentioned earlier, I think this option may be useful
> for people with inactive, but still operational, email addresses.
> Over time, those addresses will tend to get nothing but spam. (It
> would thus be important to not train on messages sent to those
? addresses before or shortly after during abandonment.)
>
> Should I rework the patch and check it in?
I'm -0, but would become +1 if it really helped someone, or nailed cases
that can't be nailed via a more general gimmick.
For example, what if you were to introduce an option to fully tokenize To:
and Cc: addresses instead? We don't even catch "Undisclosed Recipients"
now. We ignore addressees by default because it becomes a killer-strong
clue for bogus reasons when training with mixed-source corpora (e.g.,
To: bruceg@whatever
is in thousands & thousands of BruceG's spams).