Parsing a potentially corrupted file
p.f.moore at gmail.com
Wed Dec 14 09:07:27 EST 2016
On Wednesday, 14 December 2016 12:57:23 UTC, Chris Angelico wrote:
> Is the "[Component]" section something you could verify? (That is - is
> there a known list of components?) If so, I would include that as a
> secondary check. Ditto anything else you can check (I'm guessing the
> [level] is one of a small set of values too.)
Possibly, although this is to analyze the structure of a basically undocumented log format. So if I validate too tightly, I end up just checking my assumptions rather than checking the data :-(
> The logic would be
> something like this:
> Read line from file.
> Verify line as a potential record:
> Assert that line begins with timestamp.
> Verify as many fields as possible (component, level, etc)
> Search line for additional timestamp.
> If additional timestamp found:
> Recurse. If verification fails, assume we didn't really have a
> corrupted line.
> (Process partial line? Or discard?)
> If "[[" in line:
> Until line is "]]":
> Read line from file, append to description
> If timestamp found:
> Recurse. If verification succeeds, break out of loop.
> Unfortunately it's still not really clean; but that's the nature of
> working with messy data. Coping with ambiguity is *hard*.
Yeah, that's essentially what I have now. As I say, it's working but nobody could really love it. But you're right, it's more the fault of the data than of the code.
One thought I had, which I might try, is to go with the timestamp as the one assumption I make of the data, and read the file in as, in effect, a text stream, spitting out a record every time I see something matching a the [timestamp] pattern. Then parse record by record. Truncated records should either be obvious (because the delimited fields have start and end markers, so unmatched markers = truncated record) or acceptable (because undelimited fields are free text). I'm OK with ignoring the possibility that the free text contains something that looks like a timestamp.
The only problem with this approach is that I have more data than I'd really like to read into memory all at once, so I'd need to do some sort of streamed match/split processing. But thinking about it, that sounds like the sort of job a series of chained generators could manage. Maybe I'll look at that approach...
More information about the Python-list