Scanning a file character by character

Rhodri James rhodri at wildebst.demon.co.uk
Tue Feb 10 23:47:07 CET 2009


On Tue, 10 Feb 2009 22:02:57 -0000, Steven D'Aprano  
<steven at remove.this.cybersource.com.au> wrote:

> On Tue, 10 Feb 2009 12:06:06 +0000, Duncan Booth wrote:
>
>> Steven D'Aprano <steven at REMOVE.THIS.cybersource.com.au> wrote:
>>
>>> On Mon, 09 Feb 2009 19:10:28 -0800, Spacebar265 wrote:
>>>
>>>> How would I do separate lines into words without scanning one
>>>> character at a time?
>>>
>>> Scan a line at a time, then split each line into words.
>>>
>>>
>>> for line in open('myfile.txt'):
>>>     words = line.split()
>>>
>>>
>>> should work for a particularly simple-minded idea of words.
>>>
>> Or for a slightly less simple minded splitting you could try re.split:
>>
>>>>> re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
>> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>
>
> Perhaps I'm missing something, but the above regex does the exact same
> thing as line.split() except it is significantly slower and harder to
> read.
>
> Neither deal with quoted text, apostrophes, hyphens, punctuation or any
> other details of real-world text. That's what I mean by "simple-minded".

You're missing something :-)  Specifically, the punctuation gets swept
up with the whitespace, and the extended slice skips it.  Apostrophes
(and possibly hyphenation) are still a bit moot, though.



-- 
Rhodri James *-* Wildebeeste Herder to the Masses



More information about the Python-list mailing list