Scanning a file character by character

Tue Feb 10 17:02:57 EST 2009

On Tue, 10 Feb 2009 12:06:06 +0000, Duncan Booth wrote:

> Steven D'Aprano <steven at REMOVE.THIS.cybersource.com.au> wrote:
> 
>> On Mon, 09 Feb 2009 19:10:28 -0800, Spacebar265 wrote:
>> 
>>> How would I do separate lines into words without scanning one
>>> character at a time?
>> 
>> Scan a line at a time, then split each line into words.
>> 
>> 
>> for line in open('myfile.txt'):
>>     words = line.split()
>> 
>> 
>> should work for a particularly simple-minded idea of words.
>> 
> Or for a slightly less simple minded splitting you could try re.split:
> 
>>>> re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

Perhaps I'm missing something, but the above regex does the exact same 
thing as line.split() except it is significantly slower and harder to 
read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any 
other details of real-world text. That's what I mean by "simple-minded".

-- 
Steven