[Python-Dev] The first trustworthy <wink> GBayes results

Delaney, Timothy tdelaney@avaya.com
Mon, 2 Sep 2002 10:04:39 +1000


> From: Tim Peters [mailto:tim.one@comcast.net]
> 
> For example <wink>, "free!!" never appears in a ham msg in my 
> corpora, but
> appears often in the spam samples.  OTOH, plain "free" is a weak spam
> indicator on c.l.py, given the frequent supposedly on-topic 
> arguments about
> free beer versus free speech, etc.

I'd actually thought of this limitation, and how it could be avoided. This
so-called "more intelligent" tokeniser would probably work best in a system
which scored word pairs as well as single words. For example:

    "I want free beer!!!"

would be split as

    'I' 'want' 'free' 'beer' '!!!'

This might then be scored as

    'I'          0.5
    'want'       0.5
    'free'       0.5
    'beer'       0.1 (beer is unlikely to be a spam indicator ;)
    '!!!'        0.9
    'I want'     0.3
    'want free'  0.99 (do you want free hot ...?)
    'free beer'  0.01 (free beer is never a spam indicator ;)
    'beer !!!'   0.5

Whether any weighting should be applied to single words or word pairs I
don't know - my gut feeling is that they should be weighted the same, but
guts are no replacement for empirical evidence.

I just brought CVS python down at home and tried compiling with MinGW (no
success so far ...) but I'll have a look at the GBayes stuff sometime soon
and see if the above helps at all. Unfortunately, I just started my work day
...

Tim Delaney