[Python-Dev] The first trustworthy <wink> GBayes results
Mon, 2 Sep 2002 10:04:39 +1000
> From: Tim Peters [mailto:firstname.lastname@example.org]
> For example <wink>, "free!!" never appears in a ham msg in my
> corpora, but
> appears often in the spam samples. OTOH, plain "free" is a weak spam
> indicator on c.l.py, given the frequent supposedly on-topic
> arguments about
> free beer versus free speech, etc.
I'd actually thought of this limitation, and how it could be avoided. This
so-called "more intelligent" tokeniser would probably work best in a system
which scored word pairs as well as single words. For example:
"I want free beer!!!"
would be split as
'I' 'want' 'free' 'beer' '!!!'
This might then be scored as
'beer' 0.1 (beer is unlikely to be a spam indicator ;)
'I want' 0.3
'want free' 0.99 (do you want free hot ...?)
'free beer' 0.01 (free beer is never a spam indicator ;)
'beer !!!' 0.5
Whether any weighting should be applied to single words or word pairs I
don't know - my gut feeling is that they should be weighted the same, but
guts are no replacement for empirical evidence.
I just brought CVS python down at home and tried compiling with MinGW (no
success so far ...) but I'll have a look at the GBayes stuff sometime soon
and see if the above helps at all. Unfortunately, I just started my work day