[Python-Dev] The first trustworthy <wink> GBayes results

Delaney, Timothy tdelaney@avaya.com
Mon, 2 Sep 2002 08:53:39 +1000


> From: Tim Peters [mailto:tim.one@comcast.net]
> 
> Training GBayes is cheap, and the more you feed it the less need to do
> information-destroying transformations (like folding case or ignoring
> punctuation).

Speaking of which, I had a thought this morning (in the shower of course ;)
about a slightly more intelligent tokeniser.

Split on whitespace, then runs of punctuation at the end of "words" are
split off as a separate word.

So:

    a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
    
    A phrase. -> 'A', 'phrase', '.'
    
    WTF??? -> 'WTF', '???'

    >>> import module -> '>>>', 'import', 'module'

Might this be useful? No code of course ;)

Tim Delaney