[Spambayes] Moving closer to Gary's ideal

Mon, 23 Sep 2002 02:07:34 -0400

[Guido]
> ...
> One of the brief questions used charset=GB2312 (whatever that is);

One of many Chinese character sets.

> ...
> Suggestion: rather than showing the content of the fn's and fp's (the
> filenames are enough for me), would it be possible to show the
> filenames corresponding to the outliers in the ham/spam distributions?

[TestDriver]
show_charlimit: 0

in conjuction with show_false_positives and/or show_false_negatives and/or
the following.

> E.g. there's 1 message in my spam collection that scores 37.50
> acfording to the overall histogram.  How to find that one?

As above, or

[TestDriver]
show_spam_lo: .375
show_spam_hi: .4   (or whatever your next bucket boundary is)

>> BTW, it's my belief that this all works *best* if the ratio of ham
>> to spam trained on matches your real-life inbox ratio.

> That's impossible to know in my case.

By "your inbox" I mean the one you see.

> Almost all of my mail goes through the SpamAssassin setup at
> python.org, which throws all spam away.  As a result I see maybe
> 1 spam for every 50 hams -- but that's not the spam/ham ratio
> seen by the MTA for guido@python.org.

Training for your MTA is different than training for the inbox you see.  If
your interest is to help python.org do a better job, then you need to get
the spam that python.org currently rejects; if your interest is in the spam
you see despite python.org, then you should train on the email you actually
get.

>>> I did notice that many fp's were very spammish automated postings
>>> that I have specifically signed up for, like our building's
>>> announcements, product newsletters, and so on.  I haven't looked at
>>> the fn's.

>> I expect these are your moral equivalents to the conference
>> announcements in my c.l.py ham, except worse.  However, I expect you
>> have more cause for optimism about those: you (like me) are running
>> a crippled version of the algorithm because of your mixed-source
>> corpora.  The headers we're ignoring are bound to have strong clues
>> about the *senders* of the spammish stuff you've signed up for.

> Only if I saved enough of these, right?

Right.  I don't remember which scheme you're using.  Under the all-default
Graham scheme, a single instance of a word in a ham or spam training set
gives that word maximum probability strength.  In the all-default Robinson
scheme, it does not, and the "a" parameter still needs tuning.

> Any clue as to what option to try?

Have you read the "defaults" string in Options.py?  That's an embedded .ini
file explaining every option in its comments, and supplying their default
values.  mine_received_headers has been powerful for those who can use it; I
doubt that you can (I cannot; it picks up way too many clues about BruceG).
We've gotten mixed reports on count_all_header_lines (it helps or it
doesn't; I don't recall anyone saying it hurt); I can't use that either, but
that's mostly because my ham is full of Mailman list headers that appear
almost nowhere in my spam; you may have better luck with that.  Jeremy added
some options:

    basic_header_tokenize
    basic_header_skip

that are also disabled by default.  I haven't tried them.  I've been burned
so often by trying new headers that I'm not going to spend more time on that
until I have a single-source corpus.