[Tutor] Extracting data between strings

Wed May 27 16:38:43 CEST 2015

Hi Richard,

I'm not sure how advanced you are, whether you have any experience or if 
you're a total beginner. If anything I say below doesn't make sense, 
please ask! Keep your replies on the list, and somebody will be happy to 
answer.

On Wed, May 27, 2015 at 09:26:15AM -0400, richard kappler wrote:

> I'm writing a script that reads from an in-service log file in xml format
> that can grow to a couple gigs in 24 hours, then gets zipped out and
> restarts at zero.

Please be more specific -- does the log file already get zipped up each 
day, and you have to read from it, or is your script responsible for 
reading AND zipping it up?

The best solution to this is to use your operating system's logrotate 
command to zip up and rotate the logs. On Linux, Mac OS X and many Unix 
systems, logrotate is a standard utility. You should use that, it is 
reliable, configurable, heavily tested and debugged and better than 
anything you will write.

On Windows, I don't now if there is any equivalent to logrotate. 
Probably not -- Windows standard utilities is painfully impoverished and 
underpowered compared to what Linux and many Unixes provide as standard.

If you must write your own logrotate, just write a script that zips up 
and renames the log file. At *most*, check whether the file is "big 
enough". In pseudo-code:

# logrotate.py
if the file is more than X bytes in size:
    rename the log to log.temp
    create a new log file for the process to write to
    zip up log.temp as log.date.zip
check for any old log.zip files that need deleting
(e.g. "keep maximum of five log files, deleting the oldest")

Then use your operating system's scheduler to schedule this script to 
run every day at (say) 3am. Don't schedule anything between 1am and 3pm! 
If you do, then when daylight savings changes, it may run twice, or not 
run at all. Again, Linux and Unix have a standard scheduler, cron; I 
expect that even Windows will have one of those too.

The point being, don't re-invent the wheel! If your system already has a 
log rotator, use that; if it has a scheduler, use that; only if it lacks 
both should you write your own.

To read the config settings (e.g. how big is X bytes?), use the 
configparser module. To do the zipping, use the zipfile module. shutil 
and os modules will also be useful. Do you need more pointers or is that 
enough to get you started?

Now, on to the script that extracts data from the logs... 

> My script must check to see if new entries have been
> made, find specific lines based on 2 different start tags, and from those
> lines extract data between the start and end tags (hopefully including the
> tags) and write it to a file. I've got the script to read the file, see if
> it's grown, find the appropriate lines and write them to a file. 

It's probably better to use a directory listener rather than keep 
scanning the file over and over again.

Perhaps you can adapt this?

http://code.activestate.com/recipes/577968-log-watcher-tail-f-log/

More comments below:

> I  still
> need to strip out just the data I need (between the open and close tags)
> instead of writing the entire line, and also to reset eof when the nightly
> zip / new log file creation occurs. I could use some guidance on stripping
> out the data, at the moment I'm pretty lost, and I've got an idea about the
> nightly reset but any comments about that would be welcome as well. Oh, and
> the painful bit is that I can't use any modules that aren't included in the
> initial Python install. My code is appended below.
> 
> regards, Richard
> 
> 
> import time
> 
> while True:
>     #open the log file containing the data
>     file = open('log.txt', 'r')
>     #find inital End Of File offset
>     file.seek(0,2)
>     eof = file.tell()
>     #set the file size again
>     file.seek(0,2)
>     neweof = file.tell()
>     #if the file is larger...

You can tell how big the file is without opening it:

import os
filesize = os.path.getsize('path/to/file')

>     if neweof > eof:
>         #go back to last position...
>         file.seek(eof)
> # open file to which the lines will be appended
>         f1 = open('newlog.txt', 'a')
> # read new lines in log.txt
>         for line in file.readlines():

For huge files, it is ****MUCH**** more efficient to write:

    for line in file: ... 

than to use file.readlines().

>             #check if line contains needed data
>             if "usertag1" in line or "SeMsg" in line:
> 
> ############################################################
> #### this should extract the data between usertag1 and  ####
> #### and /usertag1, and between SeMsg and /SeMsg,       ####
> #### writing just that data to the new file for         ####
> #### analysis. For now, write entire line to file       ####
> ############################################################

You say "between usertag1 ... AND between SeMsg..." -- what if only one 
set of tags are available? I'm going to assume that it's possible that 
both tags could exist.

for line in file:
    for tag in ("usertag1", "SeMsg"):
        if tag in line:
            start = line.find(tag)
            end = line.find("/" + tag, start)
            if end == -1:
                 print("warning: /%d found!" % tag)
            else:
                 data = line[start:end+len(tag)+1]
                 f1.write(data)

Does this help?

-- 
Steve