[Tutor] need help parsing multiple log files to create a timeline. what am I doing wrong??

Tue Feb 18 17:51:05 EST 2020

On 19/02/20 9:42 AM, Michael Cole wrote:
> I am working on parsing a bunch of log files to construct a timeline to
> represent changes on a network so that I can view events on different
> machines in parallel with matching timestamps. However, my program is
> getting bogged down and doesn't seem to be giving the right output.
> 
> expected behavior:
> 
> each row contains the name of the file it came from, and each cell in the
> row either is blank or contains a log with a timestamp in the same column
> as the matching timestamp from the timestamp row.
> 
> observed behavior:
> 
> readlines() reads the lines correctly. The timelines and headers are all
> built correctly.
> 
> the data in each row is not populated fully, with many entries still left
> blank. some entries are filled, but only a tiny fraction.
> 
> I am seeing that python is taking 99% of the CPU and that a lot of data is
> missing from the csv file generated.
...
code
...
> Example log format:
> 
> ---------- 2020-02-13 18:06:45 -0600: Logging Started ----------
> 02-13 18:18:24.370: 00:12:42: INFO [XMOS] Media clock unlocked!
> reason: unlocked on zeroing
> 02-13 18:18:24.421: XMOS clock update. state:0 source:ff, rate:ff
> 02-13 18:18:24.656: 00:12:43: INFO [XMOS] out of sequence error. this
> seq: 16 last seq: 41 timestamp: fceb397f
> 02-13 18:18:24.709: 00:12:43: INFO [XMOS] out of sequence error. this
> seq: 57 last seq: 80 timestamp: fd3a1012
> 02-13 18:18:31.830: XMOS clock update. state:1 source:ff, rate:ff
> 02-13 18:46:41.844: 00:41:00: INFO [XMOS] Media clock unlocked!
> reason: unlocked on zeroing
> 02-13 18:46:41.896: XMOS clock update. state:0 source:ff, rate:ff
> 02-13 18:46:42.131: 00:41:00: INFO [XMOS] out of sequence error. this
> seq: 86 last seq: 111 timestamp: 38052b81
> 02-13 18:46:42.183: 00:41:00: INFO [XMOS] out of sequence error. this
> seq: 126 last s
Couple of things not directly answering the question:

- if using Linux, are you aware that there are facilities to centralise 
multiple machine's logs, which might ease your task?

- why spreadsheet? If your logs are all in txt format(?) and commence 
with a time-stamp, why not use directly?

- are you aware of the many Python implementations and tools which 
'attack' log-related problems

Would you please show some sample log-entries, in over-lapping 
time-ranges (ie which should be interleaved)?

Do any of these log entries span multiple lines?
(lines terminated by your OpSys' line-separator(s) - not lines on a screen)

> each row contains the name of the file it came from, and each cell in the
> row either is blank or contains a log with a timestamp in the same column
> as the matching timestamp from the timestamp row.

- might one row contain more than one "log"?
- is "each row" in a different worksheet from "the timestamp row" - why 
two "row"s?
- why does the above description imply that the worksheet is not 
uniformly organised? eg
	firstCol = fileNM
	secondCol = timestamp
	thirdCol = log-entry

> the data in each row is not populated fully, with many entries still left
> blank. some entries are filled, but only a tiny fraction.

Please show examples.

Have you tried extracting some of these identified-faulty log-entries or 
worksheet rows, into a testable sub-set, which you will more easily 
analyse (than if there are hundreds/thousands/tens of... records to 
sift-through)?

The multi-pass methodology may be unnecessarily complex - unless I've 
misunderstood some aspect.

The problem is a class "merge" operation (once the staple of mainframe 
batch-computing). If one can assume that each of the input-files is 
sorted (into the same sequence), then the logic is to look at the 
'current-record' from each, choose the 'earliest', output that, and 
replenish that file's current-record (rinse and repeat -  plus watch-out 
for EOF conditions!

How do you like working with databases? Assuming (>=) basic skills, 
another option is to write the logs to an RDBMS (or to make step-1 the 
copying from 'standard' logs into one), and then use SQL to do all the 
'heavy-lifting' of the merge-and-sort, and reduce 'output' to a single 
retrieval-query!
-- 
Regards =dn