best split tokens?
Tim Chase
python.list at tim.thechases.com
Fri Sep 8 17:46:12 EDT 2006
> py> import re
> py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
> py> [s for s in rgx.split(astr) if s]
> ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
> 'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
> 'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
> 'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
> 'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
> 'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
> 'all', 'William', 'Shakespear']
This regexp could be shortened to just
rgx = re.compile('\W+')
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.
If that's a problem, you should be able to use
rgx = re.compile('[^a-zA-Z]+')
This is a bit Euro-centric...ideally Python regexps would support
Posix character classes, so one could use
rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.
-tkc
More information about the Python-list
mailing list