[spambayes-dev] RE: [Spambayes] How low can you go?

Mon Dec 22 00:24:25 EST 2003

[T. Alexander Popiel]
> Actually, there have been experiments done (by me) with expiry of
> whole messages.

Yes.  By "the project" having experience I mean controlled tests run by
several across their own email mix, using exactly the same strategy, with
reporting and analysis and all that good stuff.  We've done little of that
(as a group) over the last year.

> I invite you to look at the 'expire4months' regime for my incremental
> testing harness.  Performance was worse than remembering everything,
> but significantly better than mistake-based training (with the
> 'fpfnunsure' regime).
>
> I have not done any experiments with just nuking hapaxes; I didn't see
> any reason to do a partial job instead of a full one.

There may not be one.  The question arose specifically in the context of the
mixed unigram/bigram classifier, which grows the database at a much faster
rate.  I've got ~90% hapaxes after a couple days with that, and the database
is already 3x larger than after months of mistake/unsure training under the
pure-unigram classifier.   Expiring a full message doesn't seem to make
sense after two days, or even after a week; expiring unused hapaxes may;
that's for experiment to decide.

>>> I know you're not arguing that, but if there were bidirectional
>>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole
>>> messages.
>>>
>>> That would obviate the need to track last time seen for every token.

>> Only if you don't want also to be able to expire tokens on their own.

> No... just find the most recent message that the token appeared in,
> which would be a quick search through a few message times.  A really
> quick search if you're only looking to expire hapaxes.

I don't want to expire a hapax if it's been used recently in *scoring*.
Message times can't distinguish used from unused features.  If you're doing
train-on-everything (with or without whole-msg expiration), a hapax used in
scoring becomes a non-hapax the first time it's used in scoring.  For
mistake/unsure training, a hapax used in scoring remains a hapax if the
message being scored ends up correctly classified.  Hapaxes that are never
seen again also remain hapaxes.  Distinguishing used from unused requires
recording use.

Followups set to spambayes-dev at python.org, as this speculative stuff really
doesn't belong on the general spambayes list.