At 06:05 AM 3/19/2003 -0800, Abdirizak abdi wrote:
>buf = re.compile("[a-zA-Z]+\s+")
>this was to match the followint string:
>str = 'Data sparseness is an inherent problem in statistical methods for 
>natural language processing.'
>Result: ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in', 
>'statistical', '
>methods', 'for', 'natural', 'language']
>the result is that, it gets all the tokens except the last one with the 
>processing+ dot (full stop at the back)

The problem is that \s+ expects whitespace after each word. There is no 
whitespace after 'processing'. Also you should put the pattern in a raw 
string, otherwise some \x sequences will be taken as special character.

One solution is to specify whitespace OR end of string: buf = 
re.compile(r"[a-zA-Z]+(?:\s+|$)"). \s+|$ says whitespace OR end of string. 
I put that in () due to the precedence of |, and added ?: to make it a "A 
non-grouping version of regular parentheses."

A completely different approach is to use \b to match start or end of word: 
buf = re.compile(r"\b[a-zA-Z]+\b").

If you just want to create a list of space separated words, str.split(').

