[spambayes-dev] RE: [Spambayes] How low can you go?

Thu Dec 18 20:26:43 EST 2003

[Seth Goodman]
> Thanks for taking the time to construct such a complete set of
> answers.  I learned a lot from it and I assume other list readers did
> as well.

My pleasure, but I'm afraid it was taken out of sleep time, and I can't do
that again.  So, no offense intended, I have to be very brief here, while
wanting to do more:

> Not really.  If you decrement all the token counts from a trained
> message, the database is in the exact same state as it was before you
> trained on that message (ignoring subsequent messages trained).  At
> that point, the trained message count was N-1, so that is the best
> thing to use for the probability calculation rather than N.  The
> message count will keep increasing as you train new messages but the
> token database will eventually level off.  That suggests that the
> trained message counts will become too large as time goes on.
>
> If you only expire hapaxes, perhaps the incorrect message count is a
> technicality and won't have a significant effect on the spam
> probabilities. But unless you expire non-hapaxes as well, the token
> database can't track a changing message stream very well.  Once you
> start expiring non-hapax tokens (is there a name for these?), my
> guess is that you can no longer ignore the incorrect message count
> issue.  So how _do_ you do expiration "correctly" if not by whole
> messages?

I only intend to expire hapaxes for now, with whole-msg expiration after;
but one thing at a time, and each step will take a long time for testing.
There's no rush.  The idea that all the tokens in a message could get
expired seems too implausible to me to worry about, when only hapaxes are
expired.

...

> Offhand, adding a single timestamp per message at training time sounds
> easier than tracking the last time seen for every token in the
> database.  As far as the "elaborate" scheme I suggested for variable
> expiration times, all that's involved is changing the message
> timestamp before storing it.  Since you don't have anything like that
> now, you can just ignore that idea and the extra parameter that goes
> with it.  BTW, that parameter value is not just a wild-ass guess,
> it's a SWAG (sophisticated wild-ass guess), and I don't like them any
> better than you do :)
>
> Either way, rather than frequently searching for expired tokens (in a
> very long list), you would only do token expiration when you have to
> train a new message.  At that point, you find the oldest trained
> message (from a much shorter list) and untrain it.  The extra
> complication is storing the token list with each message ID plus its
> training timestamp.  That doesn't sound big compared to cross
> referencing every token to every message it appeared in.  They're
> certainly not mutually exclusive and you later made a good argument
> for having this extra information anyway.

There are messages I never want to expire.  That creates major new UI
headaches to be doable.  I believe (but don't yet know) that expiring
hapaxes can be done without need for user intervention, and without harm.

At some point, if you want to try your ideas, *try* your ideas <wink> --
that's what Open Source is all about.  Everyone is born knowing how to
program in Python, although most don't realize it until they try.

...

> I agree completely.  This was an important motivation for expiring a
> whole message at a time.  Training mistakes would eventually drop out
> of the database without user intervention.  Not that a tool to help
> track down training mistakes wouldn't be great, but a "casual" user
> could still make occasional mistakes and the system would recover by
> itself.

Without intervention, it will also expire the screaming bright-red HTML
birthday message sent by my favorite 7-year-old niece, and when she's 8 the
next one may get tagged as spam.  These are the kinds of messages I never
want to expire.  "Elaborate" before referred to untested gimmicks for
adjusting expiration date based on "how far away" a message was from its
correct classification, etc.  I don't have a feel for whether that can be
made to work well in real life, and it needs serious implementation effort
and testing to get a good feel.  In the vanishingly small time I can still
make for this project, I need to give it to things my experience suggests
will almost certainly win with no more effort or surprises than I already
know they require enduring.

...

> Sounds like _you're_ arguing for expiration of whole messages :)

Oh yes, I do want that -- eventually.  We have no experience with that in
this project, though; we have a lot of experience with the consequences of
hapaxes, and I have no fears remaining about picking on them.

> I know you're not arguing that, but if there were bidirectional msg_id
> <-> feature_ID maps, it would be fairly easy to expire whole
> messages.

Yes, and that's a real attraction.  Doing the actual expiration would be
trivially easy and fast then.  Deciding *when* to do expiration, and of
which messages, are the things we really don't know anything about yet.

> That would obviate the need to track last time seen for every token.

Only if you don't want also to be able to expire tokens on their own.

> In any case, I hope you move in the direction of saving such maps as
> it adds so much flexibility.

Not to mention database size <wink>.

...

> All your arguments on this point make lots of sense.  I'm a little
> surprised that you had significant collisions mapping perhaps 100K
> items (my guess) into a 32-bit space.

That would be a very small database for the mixed unigram-bigram scheme, and
the unigram-only database I used most often in original testing (for
filtering high-volume tech mailing lists) contained about 350K tokens.  As
Ryan explained later, the Birthday Paradox can't be avoided here, and has
real consequences.

> I think that is rather dependent on the hash used, but that's what
> you saw.

I used Python's builtin 32-bit hash() function, and the observed collision
rate was indistinguishable from what a truly random 32-bit hash would have
produced (about one standard deviation lower).  The damnable thing is that
you only need one extremely unfortunate collision to start seeing results
that are incomprehensible to the human eye.

> Since you need the cleartext anyway, your feature-ID concept is far
> superior.

We don't *need* the cleartext, really, it's just highly desirable.  I'll
certainly endure a lot to keep the cleartext.  If this isn't the smallest or
fastest spam filter possible, I don't really care.  I don't even care
whether it's popular.  What I care about most is whether it filters my damn
spam.

> Thanks for educating me.

Don't mistake a lecture for education <wink>.  I'd love to be able to afford
the luxury of *discussing* it with you instead (you've got a lot of
plausible ideas and express them well), but afraid I just can't.  With any
luck, maybe my employer will go out of business <wink>.