Find Paths in log text - How to?
citizenkahn at gmail.com
Thu Mar 23 20:50:58 CET 2006
I am trying to parse a build log for errors. I figure I can do this
one of three ways:
- find the absolute platonic form of an error and search for that item
- create definitions of what patterns describe errors for each tool
which is used (ant, MSDEV, etc).
- rework the build such that all the return codes for all 3rd party are
captured and logged using my own error description
The return code method has a high level of effort attached and spreads
the responsibility for the task quite widely unless I can write a
little command pattern like wrapper script (which is a possibility).
Still it would mean all of the 100s of calls to tools would have to
wrapped and if any place a developer changed this, the build would leak
The relativism/definition based approach means that I must be sure to
capture all error cases which may prove difficult and false negatives
are a really dangerous problem.
In the Platonic/absolutist camp, I could define an error as an instance
of a word or phrase from the "bad list" that is not in a filename or
Bad List: [error, fatal, killed, not found].
Were I to go this way, I'd be faced with a major problem: in a world
where symbols and whitespace can be included in a path how can I
extract a path from a line of text?
Ugly Valid Paths:
C:\Program Files\A File Named Error .txt
/usr/#a file named error #.txt
This means that determining the boundaries of a path is non trivial.
Many build tools list filenames without their full path. All of my
files are <text>.<ext>, so that is a pattern that I might be able to
on windows all of my paths will start with [A-Z]:\ or \\
on unix the will tend to start with ./ or /.
Finding the starting point is not too difficult, but its that ending
I could generate a substring for each of the starting types and then
look at what came before.
for sep in [letterStart, uncStart, unixrootedStart, unixpwdStart]:
# create sub string
prePath = line.split(elem)
I could then split the postPath segment on the os.sep and the check for
unlikely cases in the list elements
- double spaces within a path element
- symbol characters within an element (although this is a little dicey)
Since I am parsing the log on the system on which it was generated, for
each path I could do an os.path.exists on the potential path.
If someone happens to know of a good method of extracting weird paths
out of logs, I'd be interested in hearing about it.
More information about the Python-list