Scanning a file character by character

Tue Feb 10 17:46:30 EST 2009

>> Or for a slightly less simple minded splitting you could try re.split:
>>
>>>>> re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
>> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> 
> 
> Perhaps I'm missing something, but the above regex does the exact same 
> thing as line.split() except it is significantly slower and harder to 
> read.
> 
> Neither deal with quoted text, apostrophes, hyphens, punctuation or any 
> other details of real-world text. That's what I mean by "simple-minded".

   >>> s = "The quick brown fox jumps, and falls over."
   >>> import re
   >>> re.split(r"(\w+)", s)[1::2]
   ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
   >>> s.split()
   ['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls', 
'over.']

Note the difference in "jumps" vs. "jumps,"  (extra comma in the 
string.split() version) and likewise the period after "over". 
Thus not quite "the exact same thing as line.split()".

I think an easier-to-read variant would be

   >>> re.findall(r"\w+", s)
   ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

which just finds words.  One could also just limit it to letters with

   re.findall("[a-zA-Z]", s)

as "\w" is a little more encompassing (letters and underscores) 
if that's a problem.

-tkc