[Spambayes] baysian news filter

Mon Apr 14 09:02:24 EDT 2003

4/14/2003 1:22:13 AM, "Meyer, Tony" <T.A.Meyer at massey.ac.nz> wrote:

>I copied this to the list since non-spam uses of spambayes has come up
>before and might interest others.  You might want to look through the
>archives for a reasonably recent (March, I think) message about using
>spambayes to classify database records IIRC.

The thread you mention was started by Skip Montanaro on March 27, 2003, titled 
"Non-email use of the spambayes project."  The body of the first posting in 
that thread is as follows:

<quote>
I've successfully applied the Spambayes code (http://spambayes.sf.net/) to a
non-email application today and thought I'd pass the concept along to
others.  Many of you on c.l.py probably are aware of the Spambayes project
which relies on user segregation of a set of email messages into spam and
ham, then combines the resulting clues they contain to predict the hamminess
or spamminess of email messages it hasn't seen before.  It works extremely
well for this, but the basic concept is applicable to other classification
problems.

I've operated the Mojam and Musi-Cal websites for several years.  Over that
time we've accumulated a sizable venue database.  Unfortunately, many
entries in the database have become stale and don't contribute anything to
the system other than to slow down queries.  Venue names get misspelled,
venues go out of business, non-venue stuff slips into the database, or other
errors occur.  As a result, I had a venue database containing roughly 35,000
entries, only about half of which were referenced by concert items in the
database.  The database as it sat couldn't be licensed to potential
customers because of all the errors it contained.  I could simply delete all
of those entries, but that would delete a lot of useful content from the
database.  Many of those currently unreferenced venue entries *are* correct
and will eventually be associated with other concerts, or will be useful as
corollary information for people using our websites or as an extra database
we can license to content consumers.

I wrote a trivial little application today which allowed me to rummage
through the unreferenced records in the database.  I could delete entries
which I felt were incorrect, but it was a one-at-a-time process.  With
15,000+ entries to scan, one-by-one wasn't going to cut it.

Then I got the idea to use the Spambayes classifier to watch what I was
doing and train on my actions.  I was viewing the records in chunks of 20
items at a time, sorted alphabetically.  I could choose to delete one or
more items or move onto the next chunk of 20 entries.  A deletion caused the
classifier to be trained on the entry as "spam".  Moving onto the next chunk
caused the classifier to be trained on the remaining undeleted entries as
"ham".  Over a short period of time, it got reasonably good at identifying
"spam".  I then started sorting each chunk of 20 items by its spambayes
score and could specify a threshold score below which to eliminate all
entries in that chunk.

The next improvement was to sort the entire mess of records by the spambayes
classification.  I was then seeing entire chunks of records whose scores
fell below the threshold and was able to delete them 20 at a time.

The entire Spambayes code is a single tokenizer generator function and a
small Classifier class:

    import spambayes.storage

    class Classifier: 
        def __init__(self): 
            self.cls = spambayes.storage.DBDictClassifier("fven.db") 

        def classify(self, d): 
            return self.cls.spamprob(tokenize(d), True) 

        def train(self, d, saved): 
            self.cls.learn(tokenize(d), saved) 

        def __del__(self): 
            self.cls.store() 

    def tokenize(d): 
        # d is a dictionary as returned by a MySQL query - tokenize the 
        # various fields, noting interesting facts 
        yield "venue length:%d" % len(d["venue"]) 
        for word in d["venue"].split(): 
            # looks like a festival - not a venue at all
            if word.lower().endswith("fest"): 
                yield "venue:<fest>" 
            yield "venue:"+word
        # most correct venue names don't contain punctuation
        if (string.translate(d["venue"], null_xlate, string.punctuation) 
            != d["venue"]): 
            yield "venue:<punctuation>"
        # no address information for this venue - less valuable
        if not d["addr1"]: 
            yield "addr1:<empty>"
        elif d["addr1"][0] not in string.digits:
            # most valid addresses in the US/Canada begin with a street number
            yield "addr1:<no number>" 
        for word in d["addr1"].split(): 
            yield "addr1:"+word 
        for word in d["addr2"].split(): 
            yield "addr2:"+word 
        yield "phone:"+d["phone"] 
        yield "city:"+d["city"].strip() 
        yield "region:"+(d["state"].strip() or d["country"].strip()) 
        yield "zip:"+d["zip"] 
        # sometimes the city gets replicated in the address, making the
        # data "dirtier" and thus less valuable
        vwords = d["venue"].lower().split() 
        for word in d["city"].lower().split(): 
            if word in vwords: 
                yield "city:<in venue>" 
                break
        # the record's id reflects its age - older records, and thus
        # smaller ids, are more likely to be outdated
        try: 
            yield "id:2**%.0f" % math.log(int(d["id"]) // 100) 
        except OverflowError: 
            yield "id:2**0" 
        return 

    ...

    classifier = Classifier()

The input to the tokenizer, instead of being an email message, is a
dictionary representing the return value from an SQL query.  When an item is
to be deleted, it gets classified like so:

    classifier.train(d, False)

When moving the the next chunk, the remaining records are classified like
so: 

    for item in chunk:
        classifier.train(item, True)

I haven't gotten too crazy with the tokenizer (compare it with the Spambayes
tokenizer!).  I will probably collect some other clues in the tokenizer,
such as what other tables reference the venue record.  For the time being,
it's working okay.  I just need it to do a reasonably good job segregating
records so I can quickly scan a group and make a deletion decision.  So far,
it's doing a very good job.  Not bad for 15-30 minutes of work...

Skip
</quote>

http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.