[Tutor] Working with files

Erik Price erikprice@mac.com
Fri, 5 Apr 2002 08:24:52 -0500

On Friday, April 5, 2002, at 03:38  AM, Alexandre Ratti wrote:

>> > import re
>> >
>> > for line in inp.readlines():
>> >   if re.search(r'Canada', line): continue # if line contains 'Canada'
>> >   outp.write(line)
>> The only thing with this is that it wouldn't catch a phrase (or a word)
>> that was split between two lines.
>> I can't think of a good solution that doesn't require the entire
>> haystack string to be read into memory.
> You could try using 'read(aChunkSize)' instead of 'readlines()'.
> Except from the library reference:
> read([size])
> Read at most size bytes from the file (less if the read hits EOF before 
> obtaining size bytes). If the size argument is negative or omitted, 
> read all data until EOF is reached. The bytes are returned as a string 
> object. An empty string is returned when EOF is encountered immediately.
> So I guess you could read and test chunks two by two. Sounds a bit more 
> complex since you'd need to test a+b, b+c, etc. I suppose there is a 
> standard solution (anyone ? :-)

Ah, I hadn't thought of that.  All along I was wondering why one would 
ever want to read certain lengths rather than something convenient like 
lines or the length of the file.  (I mean, I knew the option to specify 
a length had a place for certain power coders or people with esoteric 
needs, but it didn't seem useful to me immediately.)

But you're right, one still needs to find a way to manage overlaps 
between chunksizes.  Perhaps something like this pseudocode:

# won't work; pseudocode
# needle = string to search for
chunk_size = len(needle)
if (!f = open('filename')):
   print 'Could not open ', haystack, ' for searching'
   while 1:
     # pointer_multiplier is used to set read position in file
     pointer_multiplier = 1
     # read a section of the file twice the length of needle
     haystack = f.read(chunk_size * 2)
     # if needle is found, report it and stop
     if re.search(needle, haystack):
       print 'Found ', needle, '!'
       # here's the pseudocode b/c I don't know how to write this yet
       # (and I'm late for work so I can't look it up)

       move internal file pointer (chunk_size * pointer_multiplier) bytes 
forward from start of file
       pointer_multiplier = pointer_multiplier + 1

Well, it's a try at least.  Maybe I've overlooked something.  But it 
lets you read discrete chunks instead of the entire file all at once, 
and as far as I can tell you would still be able to find a string if you 
accounted for newlines/end-of-line characters.

Of course, if needle is half the length of the file or more, you may as 
well just read the whole file into memory.

What do you think?
