RE: [Python-Dev] The first trustworthy <wink> GBayes results

From: Tim Peters [mailto:tim.one@comcast.net]
Training GBayes is cheap, and the more you feed it the less need to do information-destroying transformations (like folding case or ignoring punctuation).
Speaking of which, I had a thought this morning (in the shower of course ;) about a slightly more intelligent tokeniser. Split on whitespace, then runs of punctuation at the end of "words" are split off as a separate word. So: a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames) A phrase. -> 'A', 'phrase', '.' WTF??? -> 'WTF', '???' >>> import module -> '>>>', 'import', 'module' Might this be useful? No code of course ;) Tim Delaney

[Delaney, Timothy]
Speaking of which, I had a thought this morning (in the shower of course ;) about a slightly more intelligent tokeniser.
"Intelligence" isn't necessarily helpful with a statistical scheme, and always makes it harder to adapt to other languages.
Split on whitespace, then runs of punctuation at the end of "words" are split off as a separate word.
For example <wink>, "free!!" never appears in a ham msg in my corpora, but appears often in the spam samples. OTOH, plain "free" is a weak spam indicator on c.l.py, given the frequent supposedly on-topic arguments about free beer versus free speech, etc.
a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
A phrase. -> 'A', 'phrase', '.'
WTF??? -> 'WTF', '???'
>>> import module -> '>>>', 'import', 'module'
The first and last are the same as just splitting on whitespace. The 2nd-last may lose the distinction between WTF??? and a solicitation to join the World Trade Federation <wink>; WTF isn't likely to make it into a list of smoking guns regardless. Hard to guess about the 2nd. The database isn't large enough to worry about reducing its size, btw -- the only gimmicks I care about are those that increase accuracy.
Might this be useful? No code of course ;)
It takes about an hour to run and evaluate tests for one change. If you want to motivate me to try, supply a patch against timtest.py (in the sandbox), else I've already got far more ideas than time to test them properly. Anyone else want to test this one?

Tim> It takes about an hour to run and evaluate tests for one change. Tim> If you want to motivate me to try, supply a patch against Tim> timtest.py (in the sandbox), else I've already got far more ideas Tim> than time to test them properly. Anyone else want to test this Tim> one? Care to identify some of those ideas? Skip

Tim> It takes about an hour to run and evaluate tests for one change. Tim> If you want to motivate me to try, supply a patch against Tim> timtest.py (in the sandbox), else I've already got far more ideas Tim> than time to test them properly. Anyone else want to test this Tim> one?
[Skip Montanaro]
Care to identify some of those ideas?
Nope, I'm puking sick of this topic now. Look for XXX comments in timtest.py for some of them. You can infer others from places where XXX comments aren't <wink>. The f-p rate can't be improved anymore (meaning that it's too low for me to measure an improvement if one were made). The f-n rate is still high, but adding more headers is likely the most effective way to cut f-n, and my testing corpora won't allow me to test that (the header lines are too damned different since my ham and spam came from entirely different sources). It's somebody else's turn now ... and thank Barry for the email pkg! It's been a joy to use.
participants (3)
-
Delaney, Timothy
-
Skip Montanaro
-
Tim Peters