Scanning a file character by character
Tim Chase
python.list at tim.thechases.com
Tue Feb 10 17:46:30 EST 2009
>> Or for a slightly less simple minded splitting you could try re.split:
>>
>>>>> re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
>> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>
>
> Perhaps I'm missing something, but the above regex does the exact same
> thing as line.split() except it is significantly slower and harder to
> read.
>
> Neither deal with quoted text, apostrophes, hyphens, punctuation or any
> other details of real-world text. That's what I mean by "simple-minded".
>>> s = "The quick brown fox jumps, and falls over."
>>> import re
>>> re.split(r"(\w+)", s)[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>>> s.split()
['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls',
'over.']
Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over".
Thus not quite "the exact same thing as line.split()".
I think an easier-to-read variant would be
>>> re.findall(r"\w+", s)
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
which just finds words. One could also just limit it to letters with
re.findall("[a-zA-Z]", s)
as "\w" is a little more encompassing (letters and underscores)
if that's a problem.
-tkc
More information about the Python-list
mailing list