processing a Very Large file

Wed May 18 08:38:15 EDT 2005

Robert Brewer wrote:

> DJTB wrote:
>> I'm trying to manually parse a dataset stored in a file. The
>> data should be converted into Python objects.
>> 
> 
> The first question I would ask is: what are you doing with "result", and
> can the consumption of "result" be done iteratively?
> 
> 

The processed data in result is input data for a real-time simulator. The
simulator chooses the data it needs from 'result', this should be done
fast, so all data in 'result' should always be accessable in O(1) -- and as
fast as possible. That's why everything should be in RAM.

In other words: the data in 'result' is not being 'consumed' sequentially
and the data is not thrown away after 'consumption' by the simulator
process. Furthermore, after loading the file, 'result' is read-only.

By the way, 'result' is now a dict:

1. Read raw data from file
2. Process data (convert to Python Object, do some preprocessing)
4. Add mapping hash(path_tuple) --> Object to 'result' dictionary
5. Start simulation process: while True:
        a. given a hash, retrieve the Object from the dictionary 
        b. use the Object for further calculations

Jeffrey Maitland wrote:

>
> Well 1 thing I would do is use the for loop for the file itself.
> Now the thing is that you are reading a file into a list  now the more you
> add to the list the more memit will take (as you know). What I this is
> doing is a line by line read instead of opening the entire file and then
> reading it.  not sure if it is due to the fact that you are creating sets
> in addition to the list.  Something else that might help (not 100% sure if
> it will is using the del option on the sets after they are put into the
> list. That way it is no longer being used by memory then overwritten
> instead it frees the memory first. Not sure if any of this will help
> exactly but I hope it does.
>
> from time import time
> from sets import Set
> from string import split
> file_name = 'pathtable_ht.dat' #change the name to avoid confusion
> result = []
> start_time = time ()
> #Note there is no open(file_name,r) needed here.

What if the data file is gzipped?
(I'm now using gzip to keep the data file small, the added time is 
neglectable compared to the time needed for the actual processing)

> for line in file(file_name):
>         splitres = line.split()
>         tuple_size = int(splitres[0])+1
>         path_tuple = tuple(splitres[1:tuple_size])
>         conflicts = Set(map(int,splitres[tuple_size:-1]))
>         # do something with 'path_tuple' and 'conflicts'
>         # ... do some processing ...
>         result.append(( path_tuple, conflicts))
>         del(conflicts) # freeing the mem
>         del(tuple) # by deleting these 2 objects.
>

I'm not a Python memory specialist, but does del immediately release/free
the memory to the OS? I thought it was impossible to let Python immediately
release memory.

By the way, does anyone know if there's a profile-like module to keep track
of memory usage per object?

Thanks in advance,
Stan.