[spambayes-dev] RE: [Spambayes] How low can you go?

Mon Dec 22 13:04:03 EST 2003

> >>> [Seth Goodman]
> >>> I know you're not arguing that, but if there were bidirectional
> >>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole
> >>> messages.
> >>>
> >>> That would obviate the need to track last time seen for every token.
>
> >> [Tim Peters]
> >> Only if you don't want also to be able to expire tokens on their own.
>
> > [T. Alexander Popiel]
> > No... just find the most recent message that the token appeared in,
> > which would be a quick search through a few message times.  A really
> > quick search if you're only looking to expire hapaxes.
>
> [Tim Peters]
> I don't want to expire a hapax if it's been used recently in *scoring*.
> Message times can't distinguish used from unused features.  If
> you're doing
> train-on-everything (with or without whole-msg expiration), a
> hapax used in
> scoring becomes a non-hapax the first time it's used in scoring.  For

But for really unusual messages of the type you were concerned about, this
may only happen once a year, or so, which is too long for a hapax-expiration
scheme.

> mistake/unsure training, a hapax used in scoring remains a hapax if the
> message being scored ends up correctly classified.  Hapaxes that are never
> seen again also remain hapaxes.  Distinguishing used from unused requires
> recording use.

--------------------------------------

I'm reposting an earlier post that didn't receive any comments (poor
netiquette, I know) because I feel it's relevant to both comments made
subsequently in this thread and the question of expiring hapaxes not
recently used vs. whole messages.  I also asked for a little help getting
started to be able to test some of my own and/or other peoples' ideas and
would still like to do that, unless you folks would prefer otherwise.

I've noticed that hapaxes do seem to contribute to scoring when the training
set is small and I think I've seen others make similar comments.  This also
may be the case for really odd messages.  So please forgive me for the
repost, but here it is:

> [Tim Peters]
> There are messages I never want to expire.  That creates major new UI
> headaches to be doable.  I believe (but don't yet know) that expiring
> hapaxes can be done without need for user intervention, and without harm.

I hope the "without harm" part is true.  See my question two sections down.

> [Tim Peters]
> At some point, if you want to try your ideas, *try* your ideas <wink> --
> that's what Open Source is all about.  Everyone is born knowing how to
> program in Python, although most don't realize it until they try.

I admit I wasn't aware that I could program in Python since birth, but I'm
willing to take your word on that.  We all have hidden potential.  So that I
don't have to re-invent that round thing with the axle in the middle, could
someone please give me some hints as to which of the mapping features we've
discussed in this thread exist or will soon exist and where I can look for
them?  I saw on spambayes-dev that there is discussion of a new database, so
I don't want to go off on a useless fork with the present db if that comes
to pass.  Search for your inner newbie when you answer this.

> > [Seth Goodman]
> > I agree completely.  This was an important motivation for expiring a
> > whole message at a time.  Training mistakes would eventually drop out
> > of the database without user intervention.  Not that a tool to help
> > track down training mistakes wouldn't be great, but a "casual" user
> > could still make occasional mistakes and the system would recover by
> > itself.
>
> [Tim Peters]
> Without intervention, it will also expire the screaming bright-red HTML
> birthday message sent by my favorite 7-year-old niece, and when
> she's 8 the
> next one may get tagged as spam.  These are the kinds of messages I never
> want to expire.  ...

Here lies my concern.  I sincerely hope that correct classification of these
infrequent, unusual messages is not hapax-driven.  If it is, the result of
pruning infrequently-used hapaxes will be as bad as deleting the whole
message.  If that is the case, the _only_ solution will be to keep either
those hapaxes or the whole message trained forever.  Either way, I agree
this is a big UI problem without an obvious intuitive solution.

It does appear from looking at the scoring of some of my "typical" messages
that hapaxes don't contribute much, as you've said before.  Could you look
at the scoring of a couple of those special messages and tell if their
scoring would be seriously affected if the hapaxes were gone?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above