High memory usage - program mistake or Python feature?

Mon May 26 15:28:29 EDT 2003

The clean (OO) solution is build a hierachy of factory classes.

class AbstractLineIteratorFactory:

def __init__( self, file ):
  self.file = file

def getIterator( self ):
  raise NotImplementedError

# This is not exactly a factory
class InMemoryIteratorFactory( AbstractLineIteratorFactory ):

  def __init__( self, file ):
   AbstractLineIteratorFactory.__init__( self, file )
   self.lines = None

  def getIterator( self ):
   """I don't answer a real iterator"
   if self.lines is None:
    self.lines = self.file.readlines()
   return self.lines

class RealIteratorFactory( AbstractLineIteratorFactory ):

  def getIterator( self ):
   return self.file.xreadlines

This solution has one drawback:
Everywhere you pass the file object you have add a second parameter
with the IteratorFactory. And you have to replace every call to
xreadlines with a call to iteratorFactory.getIterator.

You can add a __getattr__ method to AbstractIteratorFactory
as follows:

def __getattr__( self, name ):
  return getattr( self.file, name )

Thus delegating every attribute access, except getIterator and in turn 
every method call, to an attribute access on your file object.

So much for OO-theory, now the pythonic solution
(This needs at least Python 2.2):

Derive a (new style) class from the builtin file type.

class OptionalLineCachingFile( file ):

  def cacheOn( self ):
   """Activate line caching"""
   self.lines = self.readlines()
   self.xreadlines = self.getLines

  def cacheOff( self ):
   """Deactivate the cache"""
   del self.lines
   del self.xreadlines

  def getLines( self ):
   return self.lines

Than replace every call to open or file with a call to 
OptionalLineCachingFile.

Please note:
Both solutions do not answer a real iterator in the caching case.
Turning them to iterators is left as exercise to the gentle reader ;-)

HTH,
Gerald

PS: None of the code above has beend syntax checked or even tested :-]

Ben S wrote:
> Gerald Klix wrote:
> 
>>Simply call xreadlines again, that gives a new iterator.
> 
> 
> How would I do it transparently, though? I mean, I want to have several
> bits of code such as the following:
> 
> selectedLines = GetLinesThatMatch(allLines)
> 
> And I don't want to change them all if I need to switch between an
> in-memory copy of the file (for speed) and the on-disk version from
> xreadlines() (for memory conservation). Just using xreadlines again
> means I have to change all these lines of code, as I can't just say
> allLines = file.xreadlines() as it won't reset it each time.
> 
> I know this may sound like a trivial problem but I am trying to learn
> how I can use Python to isolate myself from such changes.
> 
> --
> Ben Sizer
> http://pages.eidosnet.co.uk/kylotan
> 
> 
> 
>>Ben S wrote:
>>
>>>Hmm, in quick experiments with using xreadlines instead of readlines,
>>>there is obviously the problem that while a single iteration over
>>>either container works the same way, in order to repeat iterations
>>>over xreadlines I need to somehow reset the iterator, which a quick
>>>look at the documentation doesn't show me how to do. How do I do
>>>this, so that my functions can take a list of lines without caring
>>>whether those lines are in memory or coming from xreadlines?
>>>
>>>--
>>>Ben Sizer
>>>http://pages.eidosnet.co.uk/kylotan
>>
> 
>