[Tutor] formatting xml (again)

David Rock david at graniteweb.com
Tue Dec 27 15:55:36 EST 2016


* richard kappler <richkappler at gmail.com> [2016-12-27 15:39]:
> I was actually working somewhat in that direction while I waited. I had in
> mind to use something along the lines of:
> 
> 
> stx = '\x02'
> etx = '\x03'
> line1 = ""
> 
> with open('original.log', 'r') as f1:
>    with open('new.log', 'w') as f2:
>         for line in f1:
>             if stx in line:
>                 line1 = line1 + line
>             if not stx in line:
>                 if not etx in line:
>                     line1 = line1 + line
>             if etx in line:
>                 line1 = line1 + line + '\n'
>                 f2.write(line1)
>                 line1 = ""
> 
> 
> but that didn't work. It neither broke each line on etx (multiple events
> with stx and etx on one line) nor did it concatenate the multi-line events.

A big part of the challenge sounds like it's inconsistent data
formatting.  You are going to have to identify some way to reliably
check for the beginning/end of your data for it to work.  Do you know if
you will always have \x02 at the start of a section of input, for example?  

The way I usually do log parsing in that case is use the stx as a flag
to start doing other things (ie, if I find stx, stuff lines until I see
the next stx, then dump and continue).  If you have intermediary data
that is not between your stx and etx (comment lines, other data that you
don't want), then it gets a lot harder.

If you don't have at least a marginally consistent input, your only real
option is probably going to be scanning by character and looking for the
\x02 and \x03 to get a glob of data, then parse that glob with some kind
of xml parser, since the data between those two is likely safe-ish.

-- 
David Rock
david at graniteweb.com


More information about the Tutor mailing list