[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Barry A. Warsaw barry@python.org
Wed, 21 Aug 2002 16:23:04 -0400


| As far it it goes, yes.  How would it learn?

I have some ideas about how you could hook this into Mailman to do
community/membership assisted learning.  Understanding that people
will be highly motivated to inform you about spam but not about good
messages, you essentially queue a copy of a random sampling of
messages for a few days.  Members can let the list admin know about
leaked spam (via a url or -spam address, or whatever) and after the
list admin verifies it, this trains the system on that spam.  If no
feedback on a message happens after a few days, you train the system
on that known good message.

You need list admin verification to avoid attack vectors (I get mad at
Guido so I -- a normal user -- label all his messages as spam).

| On a more mundane note, I'd like to see decoding of base64 in it.
|
| (Oh, and on a blue-sky note, has anyone taken up Graham's suggestion
| of having one of these things that looks at word pairs instead of
| words?)
|
| It's neat that ESR saw immediately that the daemon should be
| self-contained, no access to home directories.  SpamAssassin doesn't
| have a simple way of doing that, and [ISP] is modifying it to have
| one -- and you wouldn't believe the resistance to the proposed
| changes from some of the SA developers.  Some of them really seem
| to think that it's better and simpler to store user configuration
| in a database than to have the client send its config file to the
| server along with each message.

>>>>> "ZW" == Zack Weinberg <zack@codesourcery.com> writes:

    ZW> I remember you said you didn't want to do base64 decode
    ZW> because it was too slow?

But there might be some interesting, integrated ways around that.  Say
for example, you take a Python-enabled mail server, parse the message
into its decoded form early (but not before low level SMTP-based
rejections) and then pass that parsed and decoded message object tree
around to all the other subsystems that are interested, e.g. the Bayes
filter, and Mailman.  You can at least amortize the cost of parsing
and decoding once for the rest of the lifetime of that message on your
system.

I think we have all the pieces in place to play with this approach on
python.org.

-Barry