[Tutor] Working with files

Erik Price erikprice@mac.com
Fri, 5 Apr 2002 08:24:52 -0500


On Friday, April 5, 2002, at 03:38  AM, Alexandre Ratti wrote:

>> > import re
>> >
>> > for line in inp.readlines():
>> >   if re.search(r'Canada', line): continue # if line contains 'Canada'
>> >   outp.write(line)
>>
>> The only thing with this is that it wouldn't catch a phrase (or a word)
>> that was split between two lines.
>>
>> I can't think of a good solution that doesn't require the entire
>> haystack string to be read into memory.
>
> You could try using 'read(aChunkSize)' instead of 'readlines()'.
>
> Except from the library reference:
>
> read([size])
> Read at most size bytes from the file (less if the read hits EOF before 
> obtaining size bytes). If the size argument is negative or omitted, 
> read all data until EOF is reached. The bytes are returned as a string 
> object. An empty string is returned when EOF is encountered immediately.
>
> So I guess you could read and test chunks two by two. Sounds a bit more 
> complex since you'd need to test a+b, b+c, etc. I suppose there is a 
> standard solution (anyone ? :-)

Ah, I hadn't thought of that.  All along I was wondering why one would 
ever want to read certain lengths rather than something convenient like 
lines or the length of the file.  (I mean, I knew the option to specify 
a length had a place for certain power coders or people with esoteric 
needs, but it didn't seem useful to me immediately.)

But you're right, one still needs to find a way to manage overlaps 
between chunksizes.  Perhaps something like this pseudocode:

# won't work; pseudocode
# needle = string to search for
chunk_size = len(needle)
if (!f = open('filename')):
   print 'Could not open ', haystack, ' for searching'
else:
   while 1:
     # pointer_multiplier is used to set read position in file
     pointer_multiplier = 1
     # read a section of the file twice the length of needle
     haystack = f.read(chunk_size * 2)
     # if needle is found, report it and stop
     if re.search(needle, haystack):
       print 'Found ', needle, '!'
       break
     else:
       # here's the pseudocode b/c I don't know how to write this yet
       # (and I'm late for work so I can't look it up)

       move internal file pointer (chunk_size * pointer_multiplier) bytes 
forward from start of file
       pointer_multiplier = pointer_multiplier + 1



Well, it's a try at least.  Maybe I've overlooked something.  But it 
lets you read discrete chunks instead of the entire file all at once, 
and as far as I can tell you would still be able to find a string if you 
accounted for newlines/end-of-line characters.

Of course, if needle is half the length of the file or more, you may as 
well just read the whole file into memory.

What do you think?


Erik