best split tokens?

Fri Sep 8 17:46:12 EDT 2006

> py> import re
> py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
> py> [s for s in rgx.split(astr) if s]
> ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers', 
> 'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did', 
> 'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to', 
> 'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily', 
> 'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the', 
> 'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for', 
> 'all', 'William', 'Shakespear']

This regexp could be shortened to just

	rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you 
have things like "fatal1ty", "thing2", or "pdf2txt") which is 
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

	rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideally Python regexps would support 
Posix character classes, so one could use

	rgx = re.compile('[^[:alpha:]]+')

or something of the like...however, that fails on my python2.4 here.

-tkc