[Tutor] formatting xml (again)

richard kappler richkappler at gmail.com
Tue Dec 27 14:44:36 EST 2016


Using python 2.7 - I have a large log file we recorded of streamed xml data
that I now need to feed into another app for stress testing. The problem is
the data comes in 2 formats.

1. each 'event' is a full set of xml data with opening and closing tags +
x02 and x03 (stx and etx)

2. some events have all the xml data on one 'line' in the log, others are
in typical nested xml format with lots of white space and multiple 'lines'
in the log for each event, the first line of th e 'event' starting with an
stx and the last line of the 'event' ending in an etx.

Examples of how the data looks from an editor (gedit specifically in this
case):
1[x02]<data><level1_data>some stuff<level2_data>some more
stuff></level2_data></level1_data></data>[x03]
2[x02]<data>
3     <level1_data>
4           some stuff
5           <level2_data>somestuff</level2_data>
6    </level1_data>
7</data>[x03]

I have tried to feed this raw into our other app (Splunk) and the app reads
each line (gedit numbered line) as an event. I want everything in between
each stx and etx to be one event.

I have tried:

#####################################
with open("original.log", 'r') as f1:
    with open("new.log", 'a') as f2:
        for line in f1:
            line2 = line.replace("\n", "")
            f2.write(line2)
######################################

Now this obviously doesn't work because, as stated above, each tag and
datum in the example above from lines 2 to 7 is on a different line, so
python is doing exactly as I tell it, it's stripping the \n and then
printing the line, but not concatenating everything between stx and etx on
one line, which is what I want it to do.

What I'm trying to do is collapse the 'expanded lines' between stx and etx
to one line, but I just can't wrap my head around how to do it. Or to put,
and do, it another way, how do I read each line from the original file, but
write it to another file so that everything from stx to etx, including stx
and etx, are on one line in the file?

regards Richard


More information about the Tutor mailing list