[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Sun, 01 Sep 2002 19:40:38 -0400


[Delaney, Timothy]
> Speaking of which, I had a thought this morning (in the shower of
> course ;) about a slightly more intelligent tokeniser.

"Intelligence" isn't necessarily helpful with a statistical scheme, and
always makes it harder to adapt to other languages.

> Split on whitespace, then runs of punctuation at the end of "words" are
> split off as a separate word.

For example <wink>, "free!!" never appears in a ham msg in my corpora, but
appears often in the spam samples.  OTOH, plain "free" is a weak spam
indicator on c.l.py, given the frequent supposedly on-topic arguments about
free beer versus free speech, etc.

>     a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
>
>     A phrase. -> 'A', 'phrase', '.'
>
>     WTF??? -> 'WTF', '???'
>
>     >>> import module -> '>>>', 'import', 'module'

The first and last are the same as just splitting on whitespace.  The
2nd-last may lose the distinction between WTF??? and a solicitation to join
the World Trade Federation <wink>; WTF isn't likely to make it into a list
of smoking guns regardless.  Hard to guess about the 2nd.  The database
isn't large enough to worry about reducing its size, btw -- the only
gimmicks I care about are those that increase accuracy.

> Might this be useful? No code of course ;)

It takes about an hour to run and evaluate tests for one change.  If you
want to motivate me to try, supply a patch against timtest.py (in the
sandbox), else I've already got far more ideas than time to test them
properly.  Anyone else want to test this one?