[Tutor] Working with files
Erik Price
erikprice@mac.com
Fri, 5 Apr 2002 08:24:52 -0500
On Friday, April 5, 2002, at 03:38 AM, Alexandre Ratti wrote:
>> > import re
>> >
>> > for line in inp.readlines():
>> > if re.search(r'Canada', line): continue # if line contains 'Canada'
>> > outp.write(line)
>>
>> The only thing with this is that it wouldn't catch a phrase (or a word)
>> that was split between two lines.
>>
>> I can't think of a good solution that doesn't require the entire
>> haystack string to be read into memory.
>
> You could try using 'read(aChunkSize)' instead of 'readlines()'.
>
> Except from the library reference:
>
> read([size])
> Read at most size bytes from the file (less if the read hits EOF before
> obtaining size bytes). If the size argument is negative or omitted,
> read all data until EOF is reached. The bytes are returned as a string
> object. An empty string is returned when EOF is encountered immediately.
>
> So I guess you could read and test chunks two by two. Sounds a bit more
> complex since you'd need to test a+b, b+c, etc. I suppose there is a
> standard solution (anyone ? :-)
Ah, I hadn't thought of that. All along I was wondering why one would
ever want to read certain lengths rather than something convenient like
lines or the length of the file. (I mean, I knew the option to specify
a length had a place for certain power coders or people with esoteric
needs, but it didn't seem useful to me immediately.)
But you're right, one still needs to find a way to manage overlaps
between chunksizes. Perhaps something like this pseudocode:
# won't work; pseudocode
# needle = string to search for
chunk_size = len(needle)
if (!f = open('filename')):
print 'Could not open ', haystack, ' for searching'
else:
while 1:
# pointer_multiplier is used to set read position in file
pointer_multiplier = 1
# read a section of the file twice the length of needle
haystack = f.read(chunk_size * 2)
# if needle is found, report it and stop
if re.search(needle, haystack):
print 'Found ', needle, '!'
break
else:
# here's the pseudocode b/c I don't know how to write this yet
# (and I'm late for work so I can't look it up)
move internal file pointer (chunk_size * pointer_multiplier) bytes
forward from start of file
pointer_multiplier = pointer_multiplier + 1
Well, it's a try at least. Maybe I've overlooked something. But it
lets you read discrete chunks instead of the entire file all at once,
and as far as I can tell you would still be able to find a string if you
accounted for newlines/end-of-line characters.
Of course, if needle is half the length of the file or more, you may as
well just read the whole file into memory.
What do you think?
Erik