Parsing a potentially corrupted file

Wed Dec 14 14:39:40 EST 2016

On 2016-12-14 11:43, Paul Moore wrote:
> I'm looking for a reasonably "clean" way to parse a log file that potentially has incomplete records in it.
>
> The basic structure of the file is a set of multi-line records. Each record starts with a series of fields delimited by [...] (the first of which is always a date), optionally separated by whitespace. Then there's a trailing "free text" field, optionally followed by a multi-line field delimited by [[...]]
>
> So, example records might be
>
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here
>
> (a record delimited by the end of the line)
>
> or
>
> [2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here [[Additional
> data, potentially multiple lines
>
> including blank lines
> goes here
> ]]
>
> The terminating ]] is on a line of its own.
>
> This is a messy format to parse, but it's manageable. However, there's a catch. Because the logging software involved is broken, I can occasionally get a log record prematurely terminated with a new record starting mid-stream. So something like the following:
>
> [2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of the issue goes here
>
> I'm struggling to find a "clean" way to parse this. I've managed a clumsy approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case where this gets truncated) and then treating each entry as a record and parsing it individually. But the resulting code isn't exactly maintainable, and I'm looking for something cleaner.
>
> Does anyone have any suggestions for a good way to parse this data?
>
I think I'd do something like this:

while have_more(input):
     # At the start of a record.
     timestamp = parse_timestamp(input)

     fields = []
     description = None
     additional = None

     try:
         for i in range(5):
             # A field shouldn't contain a '[', so if it sees one one, it'll
             # push it back and return True for truncated.
             field, truncated = parse_field(input)
             fields.append(fields)

             if truncated:
                 raise TruncatedError()

         # The description shouldn't contain a timestamp, but if it 
does, it'll
         # push it back from that point and return True for truncated.
         description, truncated = parse_description(input)

         if truncated:
             raise TruncatedError()

         # The additional information shouldn't contain a timestamp, but 
if it
         # does, it'll push it back from that point and return True for
         # truncated.
         additional, truncated = parse_additional_information(input)

         if truncated:
             raise TruncatedError()
     except TruncatedError:
         process_record(timestamp, fields, description, additional, 
truncated=True)
     else:
         process_record(timestamp, fields, description, additional)