Scanning a file character by character
rhodri at wildebst.demon.co.uk
Tue Feb 10 17:47:07 EST 2009
On Tue, 10 Feb 2009 22:02:57 -0000, Steven D'Aprano
<steven at remove.this.cybersource.com.au> wrote:
> On Tue, 10 Feb 2009 12:06:06 +0000, Duncan Booth wrote:
>> Steven D'Aprano <steven at REMOVE.THIS.cybersource.com.au> wrote:
>>> On Mon, 09 Feb 2009 19:10:28 -0800, Spacebar265 wrote:
>>>> How would I do separate lines into words without scanning one
>>>> character at a time?
>>> Scan a line at a time, then split each line into words.
>>> for line in open('myfile.txt'):
>>> words = line.split()
>>> should work for a particularly simple-minded idea of words.
>> Or for a slightly less simple minded splitting you could try re.split:
>>>>> re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
>> ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
> Perhaps I'm missing something, but the above regex does the exact same
> thing as line.split() except it is significantly slower and harder to
> Neither deal with quoted text, apostrophes, hyphens, punctuation or any
> other details of real-world text. That's what I mean by "simple-minded".
You're missing something :-) Specifically, the punctuation gets swept
up with the whitespace, and the extended slice skips it. Apostrophes
(and possibly hyphenation) are still a bit moot, though.
Rhodri James *-* Wildebeeste Herder to the Masses
More information about the Python-list