From: Tim Peters [mailto:email@example.com]
For example <wink>, "free!!" never appears in a ham msg in my corpora, but appears often in the spam samples. OTOH, plain "free" is a weak spam indicator on c.l.py, given the frequent supposedly on-topic arguments about free beer versus free speech, etc.
I'd actually thought of this limitation, and how it could be avoided. This so-called "more intelligent" tokeniser would probably work best in a system which scored word pairs as well as single words. For example:
"I want free beer!!!"
would be split as
'I' 'want' 'free' 'beer' '!!!'
This might then be scored as
'I' 0.5 'want' 0.5 'free' 0.5 'beer' 0.1 (beer is unlikely to be a spam indicator ;) '!!!' 0.9 'I want' 0.3 'want free' 0.99 (do you want free hot ...?) 'free beer' 0.01 (free beer is never a spam indicator ;) 'beer !!!' 0.5
Whether any weighting should be applied to single words or word pairs I don't know - my gut feeling is that they should be weighted the same, but guts are no replacement for empirical evidence.
I just brought CVS python down at home and tried compiling with MinGW (no success so far ...) but I'll have a look at the GBayes stuff sometime soon and see if the above helps at all. Unfortunately, I just started my work day ...