[Python-Dev] The first trustworthy <wink> GBayes results
Tim Peters
tim.one@comcast.net
Sun, 01 Sep 2002 19:40:38 -0400
[Delaney, Timothy]
> Speaking of which, I had a thought this morning (in the shower of
> course ;) about a slightly more intelligent tokeniser.
"Intelligence" isn't necessarily helpful with a statistical scheme, and
always makes it harder to adapt to other languages.
> Split on whitespace, then runs of punctuation at the end of "words" are
> split off as a separate word.
For example <wink>, "free!!" never appears in a ham msg in my corpora, but
appears often in the spam samples. OTOH, plain "free" is a weak spam
indicator on c.l.py, given the frequent supposedly on-topic arguments about
free beer versus free speech, etc.
> a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
>
> A phrase. -> 'A', 'phrase', '.'
>
> WTF??? -> 'WTF', '???'
>
> >>> import module -> '>>>', 'import', 'module'
The first and last are the same as just splitting on whitespace. The
2nd-last may lose the distinction between WTF??? and a solicitation to join
the World Trade Federation <wink>; WTF isn't likely to make it into a list
of smoking guns regardless. Hard to guess about the 2nd. The database
isn't large enough to worry about reducing its size, btw -- the only
gimmicks I care about are those that increase accuracy.
> Might this be useful? No code of course ;)
It takes about an hour to run and evaluate tests for one change. If you
want to motivate me to try, supply a patch against timtest.py (in the
sandbox), else I've already got far more ideas than time to test them
properly. Anyone else want to test this one?