[spambayes-dev] RE: [Spambayes] How low can you go?

Seth Goodman nobody at spamcop.net
Fri Dec 19 17:58:10 EST 2003


> [Tim Peters]
> There are messages I never want to expire.  That creates major new UI
> headaches to be doable.  I believe (but don't yet know) that expiring
> hapaxes can be done without need for user intervention, and without harm.

I hope the "without harm" part is true.  See my question two sections down.


> [Tim Peters]
> At some point, if you want to try your ideas, *try* your ideas <wink> --
> that's what Open Source is all about.  Everyone is born knowing how to
> program in Python, although most don't realize it until they try.

I admit I wasn't aware that I could program in Python since birth, but I'm
willing to take your word on that.  We all have hidden potential.  So that I
don't have to re-invent that round thing with the axle in the middle, could
someone please give me some hints as to which of the mapping features we've
discussed in this thread exist or will soon exist and where I can look for
them?  I saw on spambayes-dev that there is discussion of a new database, so
I don't want to go off on a useless fork with the present db if that comes
to pass.  Search for your inner newbie when you answer this.


> > [Seth Goodman]
> > I agree completely.  This was an important motivation for expiring a
> > whole message at a time.  Training mistakes would eventually drop out
> > of the database without user intervention.  Not that a tool to help
> > track down training mistakes wouldn't be great, but a "casual" user
> > could still make occasional mistakes and the system would recover by
> > itself.
>
> [Tim Peters]
> Without intervention, it will also expire the screaming bright-red HTML
> birthday message sent by my favorite 7-year-old niece, and when
> she's 8 the
> next one may get tagged as spam.  These are the kinds of messages I never
> want to expire.  ...

Here lies my concern.  I sincerely hope that correct classification of these
infrequent, unusual messages is not hapax-driven.  If it is, the result of
pruning infrequently-used hapaxes will be as bad as deleting the whole
message.  If that is the case, the _only_ solution will be to keep either
those hapaxes or the whole message trained forever.  Either way, I agree
this is a big UI problem without an obvious intuitive solution.

It does appear from looking at the scoring of some of my "typical" messages
that hapaxes don't contribute much, as you've said before.  Could you look
at the scoring of a couple of those special messages and tell if their
scoring would be seriously affected if the hapaxes were gone?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list