[Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.14,1.15

Tim Peters tim.one@comcast.net
Wed, 28 Aug 2002 00:15:20 -0400


[Skip]
> Modified Files:
> 	GBayes.py
> Log Message:
> ehh - it actually didn't work all that well.  the spurious report that it
> did well was pilot error.

That's why it's taken me so long to report anything <0.5 wink> -- it's mondo
tedious work to get trustworthy results.

> besides, tim's report suggests that a simple str.split() may be the
> best tokenizer anyway.

That's merely the first one I tried after I *finally* got the corpora into
usable shape.  Its chief virtue in my eyes was speed; I'm pleasantly
surprised at how well it did, and despite ignoring headers, and especially
despite that we appear to have trained it with plenty of spam claiming to be
ham.

Since the ham archive proved to be tainted, the next thing is to clean it
up.  Then I'll try lots of tokenization gimmicks.

I'm doing a quick run on character 3-grams now for the heck of it:

def tokenize_3gram(string):
    for i in xrange(len(string)-2):
        yield string[i : i+3]

applied to the message body as one giant string without any transformations
(so newlines are in there, and blanks, and mixed case, etc).

The first thing I note is that the false negative rate has gone way down:
where, e.g., the whitespace splitter sucked up entire lines of
quoted-printable as single "words", character 3-grams are much better at
finding something damning in that.

OTOH, the false positive rate has gone up significantly, and it's not
obvious why.  But so many of the false positives were actually spam in the
first run that there's not much point thinking more about that until it's
cleaned up.

Before this started, my intuition was that word 2-grams would work best for
text parts, that character n-grams would work best for gibberish parts (like
HTML and JavaScript and MIME decorations), and that headers and embedded
URLs should get special care.  All of that is quite testable, but I need a
cleaner ham corpus first ...