[Tutor] formatting xml (again)

richard kappler richkappler at gmail.com
Tue Dec 27 16:05:26 EST 2016

The input is consistent in that it all has stx at the beginning of each
'event.' I'm leaning towards regex. When you say:

" find stx, stuff lines until I see the next stx, then dump and continue"

Might I trouble you for an example of how you do that? I can find stx, I
can find etx using something along the lines of :

a = [m.start() for m in re.finditer(r"<devicename>", line)]

but then I get a little lost, mostly because I have some lines that have
"data data [\x03][\x02] data" and then to the next line. More succinctly,
the stx aren't always at the beginning of the line, etx not always at the
end. No problem, I can find them, but then I'm guessing I would have to
write to a buffer starting with stx, keep writing to the buffer until I get
to etx, write the buffer to file (or send it over the socket, either way is
fine) then continue on. The fact that 'events' span multiple lines is
challenging me.

On Tue, Dec 27, 2016 at 3:55 PM, David Rock <david at graniteweb.com> wrote:

> * richard kappler <richkappler at gmail.com> [2016-12-27 15:39]:
> > I was actually working somewhat in that direction while I waited. I had
> in
> > mind to use something along the lines of:
> >
> >
> > stx = '\x02'
> > etx = '\x03'
> > line1 = ""
> >
> > with open('original.log', 'r') as f1:
> >    with open('new.log', 'w') as f2:
> >         for line in f1:
> >             if stx in line:
> >                 line1 = line1 + line
> >             if not stx in line:
> >                 if not etx in line:
> >                     line1 = line1 + line
> >             if etx in line:
> >                 line1 = line1 + line + '\n'
> >                 f2.write(line1)
> >                 line1 = ""
> >
> >
> > but that didn't work. It neither broke each line on etx (multiple events
> > with stx and etx on one line) nor did it concatenate the multi-line
> events.
> A big part of the challenge sounds like it's inconsistent data
> formatting.  You are going to have to identify some way to reliably
> check for the beginning/end of your data for it to work.  Do you know if
> you will always have \x02 at the start of a section of input, for example?
> The way I usually do log parsing in that case is use the stx as a flag
> to start doing other things (ie, if I find stx, stuff lines until I see
> the next stx, then dump and continue).  If you have intermediary data
> that is not between your stx and etx (comment lines, other data that you
> don't want), then it gets a lot harder.
> If you don't have at least a marginally consistent input, your only real
> option is probably going to be scanning by character and looking for the
> \x02 and \x03 to get a glob of data, then parse that glob with some kind
> of xml parser, since the data between those two is likely safe-ish.
> --
> David Rock
> david at graniteweb.com
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

More information about the Tutor mailing list