[spambayes-dev] RE: [Spambayes] How low can you go?
Seth Goodman
nobody at spamcop.net
Wed Dec 17 13:21:08 EST 2003
[Tim Stone]
> Tim> iirc, there was quite a bit of discussion about aging mechanisms
> Tim> quite a few months ago. It seemed like most everyone agreed that
> Tim> it was a good idea, but nobody wanted to implement it
> for database
> Tim> size considerations. It still seems like a good idea...
>
> [Skip Montanaro]
> Size definitely does matter. <wink> With both bigrams and my set/used
> timestamps (datetime objects), the size of the database
> ballooned. I think
> the set timestamp could be dispensed with and the last used timestamp
> converted to something smaller, like a YYYYMMDD string.
I know this is a developer conversation, so I hope you don't mind if I offer
my two cents. And I definitely agree that size matters, at least for
databases. I have seen a lot of references, not just in this thread, to
ageing out individual tokens. For a probability calculation in which one of
the variables is the number of messages of a given class that a token
appears in, it seems dangerous to remove only some tokens from a message and
not adjust the message count. Here's my problem with it: all tokens from a
trained message *could* conceivably age out individually, but the trained
message count for the appropriate category would not change. This would
result in a wrong probabilities for *all* other tokens, since the database
is the same state as before the message was trained but the trained message
count is now wrong. It is even harder to conceive what the trained message
count should be if you only remove some of the tokens from a message. Using
a token ageing scheme, the trained message counts would monotonically rise
until you started over, despite removing plenty of tokens over time. I do
understand that most of the aged out tokens would be oddball hapaxes, but
not all of them will be.
Though I often hear "intuition is a poor guide", I would propose ageing out
whole messages rather than tokens. This at least maintains the integrity of
your basic probability calculation. It also has the advantage of enforcing
balanced (or unbalanced in a particular way) training set size. This would
require adding all the tokens from a trained message to the message database
and the message entry would be timestamped rather than the individual
tokens. When a message got too old, all it's tokens would have their counts
decremented and the trained message count for that message class would also
be decremented.
I would propose going one step further to give the train on everything
approach some additional "memory" for atypical messages (of either type)
that don't occur regularly enough to always be in a fixed-size database.
This might give it some of the advantages of the train on exceptions
schemes, perhaps with less of the "brittle" behavior others have noted and I
have seen as well. One possible mechanism to do this is as follows:
1) If the database message count is at maximum, untrain the oldest message.
2) Score the new message to be trained.
3) Move the new training message timestamp into the future by an amount
related to it's "distance" from a perfect score for that message type.
More atypical messages that classify poorly would be timestamped further
into the future and would thus stick around longer than ones that classify
perfectly. The ones that classify perfectly would have their tokens
replaced sooner, which should be no great loss. With train on everything,
there should be lots of messages that classify very well to take their
place. There could be a scaling constant that sets the maximum amount of
extra time that an unusual message remains in the database. This determines
how long the database "memory" is, along with the maximum message count and
the number of messages that you train per day (depends on your training
scheme).
The goal of this is to allow train on everything, keep moderate database
sizes and still have a long enough memory for atypical messages that are
infrequent.
--
Seth Goodman
Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com
Spambots: disregard the above
More information about the spambayes-dev
mailing list