[Tutor] Extracting data between strings

Peter Otten __peter__ at web.de
Wed May 27 18:43:52 CEST 2015


richard kappler wrote:

> I'm writing a script that reads from an in-service log file in xml format
> that can grow to a couple gigs in 24 hours, then gets zipped out and
> restarts at zero. My script must check to see if new entries have been
> made, find specific lines based on 2 different start tags, and from those
> lines extract data between the start and end tags (hopefully including the
> tags) and write it to a file. I've got the script to read the file, see if
> it's grown, find the appropriate lines and write them to a file. I  still
> need to strip out just the data I need (between the open and close tags)
> instead of writing the entire line, and also to reset eof when the nightly
> zip / new log file creation occurs. I could use some guidance on stripping
> out the data, at the moment I'm pretty lost, and I've got an idea about
> the nightly reset but any comments about that would be welcome as well.
> Oh, and the painful bit is that I can't use any modules that aren't
> included in the initial Python install. My code is appended below.

You'll probably end up with something closer to your original idea and 
Steven's suggestions, but let me just state that the idea of a line and xml 
exist in two distinct worlds. Here's my (sketchy) attempt to treat xml as 
xml rather than lines in a text file. (I got some hints here: 
http://effbot.org/zone/element-iterparse.htm).

import os
import sys
import time

from xml.etree.ElementTree import iterparse, tostring

class Rollover(Exception):
    pass

class File:
    def __init__(self, filename, sleepinterval=1):
        self.size = 0
        self.sleepinterval = sleepinterval
        self.filename = filename
        self.file = open(filename, "rb")

    def __enter__(self):
        return self

    def __exit__(self, etype, evalue, traceback):
        self.file.close()

    def read(self, size=0):
        while True:
            s = self.file.read(size)
            if s:
                return s
            else:
                time.sleep(self.sleepinterval)
                self.check_rollover()

    def check_rollover(self):
        newsize = os.path.getsize(self.filename)
        if newsize < self.size:
            raise Rollover()
        self.size = newsize

WANTED_TAGS = {"usertag1", "SeMsg"}

while True:
    try:
        with File("log.txt") as f:
            context = iterparse(f, events=("start", "end"))
            event, root = next(context)
            wanted_count = 0

            for event, element in context:
                if event == "start" and element.tag in WANTED_TAGS:
                    wanted_count += 1
                else:
                    assert event == "end"
                    if element.tag in WANTED_TAGS:
                        wanted_count -= 1
                        print("LOGGING")
                        print( tostring(element))
                if wanted_count == 0:
                    root.clear()
    except Rollover as err:
        print("doing a rollover", file=sys.stderr)




More information about the Tutor mailing list