Mailman 3 RE: [Python-Dev] The first trustworthy <wink> GBayes results - Python-Dev

2 Sep 2002

      ...
From: Tim Peters [mailto:tim.one@comcast.net]
For example <wink>, "free!!" never appears in a ham msg in my 
corpora, but
appears often in the spam samples.  OTOH, plain "free" is a weak spam
indicator on c.l.py, given the frequent supposedly on-topic 
arguments about
free beer versus free speech, etc.
I'd actually thought of this limitation, and how it could be avoided. This
so-called "more intelligent" tokeniser would probably work best in a system
which scored word pairs as well as single words. For example:

    "I want free beer!!!"

would be split as

    'I' 'want' 'free' 'beer' '!!!'

This might then be scored as

    'I'          0.5
    'want'       0.5
    'free'       0.5
    'beer'       0.1 (beer is unlikely to be a spam indicator ;)
    '!!!'        0.9
    'I want'     0.3
    'want free'  0.99 (do you want free hot ...?)
    'free beer'  0.01 (free beer is never a spam indicator ;)
    'beer !!!'   0.5

Whether any weighting should be applied to single words or word pairs I
don't know - my gut feeling is that they should be weighted the same, but
guts are no replacement for empirical evidence.

I just brought CVS python down at home and tried compiling with MinGW (no
success so far ...) but I'll have a look at the GBayes stuff sometime soon
and see if the above helps at all. Unfortunately, I just started my work day
...

Tim Delaney

RE: [Python-Dev] The first trustworthy <wink> GBayes results

Delaney, Timothy

tags

participants (1)