[Tutor] processing multi entry logs

Tue Aug 15 05:02:37 CEST 2006

Reed L. O'Brien wrote:
> I have a log file. Essentially the file has 2 important entries for 
> each process id. One when the process starts with an id and a another 
> piece of data. the second is when the process finishes, with the 
> result also with the process id. I need to get data from both to make 
> a sensible representation of the data. The file can be very large, in 
> excess of 400MB. And the process id entries can be any random distance 
> apart.
>
Are you in control of the format of the log file?  Is it possible that, 
in the future, you could instead log everything in an SQL table or 
something of the sort
to make it easier to get at the data you want?
I understand that now you have a log that you need to parse, but if 
every time you need something from the log you have to parse 400MB of text
it might take a little longer than you'd like.

> I am hoping for input regarding the best way to do it.
>
> I can't think of an efficient way to store the data from the first entry. 
>
> Keep processing line by line and check against the partially recorded ids?
>
>
> Maintain seperate lists and merge them at the end?

You could do something like (in semi-python)
tasks = {}
for item in logfile.readlines():
    if item is start_data:
       tasks[process_id] = [start_data-process_id]
    elif item is end_data:
       tasks[process_id].append(end_data-process_id)

This should work because you said the process_id is common to both the 
startdata and enddata.
The only problem is determining if the entry is start or end data.

Now you have a dictionary where the keywords are the processids and the 
data is
a two-element list.
Does that do what you need?
HTH,
-Luke