[Tutor] Iterable Understanding

Mon Nov 16 01:11:07 CET 2009

Hi Marty,

Thanks for a very lucid reply!

> Well, you haven't described the unreliable behavior of unix sort so I
> can only guess, but I assume you know about the --month-sort (-M) flag?

Nope - but I can look it up.  The problem I have is that the source
logs are rotated at 0400 hrs, so I need two days of logs in order to
extract 24 hrs from 0000 to 2359 (which is the requirement).  At
present, I preprocess using sort, which works fine as long as the
month doesn't change.

> import gzip
> from heapq import heappush, heappop, merge

Is this a preferred method, rather than just 'import heapq'?

> def timestamp(line):
>    # replace with your own timestamp function
>    # this appears to work with the sample logs I chose
>    stamp = ' '.join(line.split(' ', 3)[:-1])
>    return time.strptime(stamp, '%b %d %H:%M:%S')

I have some logfie entries with multiple IP addresses, so I can't
split using whitespace.

> class LogFile(object):
>    def __init__(self, filename, jitter=10):
>        self.logfile = gzip.open(filename, 'r')
>        self.heap = []
>        self.jitter = jitter
>
>    def __iter__(self):
>        while True:
>            for logline in self.logfile:
>                heappush(self.heap, (timestamp(logline), logline))
>                if len(self.heap) >= self.jitter:
>                    break

Really nice way to handle the batching of the initial heap - thank you!

>            try:
>                yield heappop(self.heap)
>            except IndexError:
>                raise StopIteration
>
> logs = [
>    LogFile("/home/stephen/qa/ded1353/quick_log.gz"),
>    LogFile("/home/stephen/qa/ded1408/quick_log.gz"),
>    LogFile("/home/stephen/qa/ded1409/quick_log.gz")
> ]
>
> merged_log = merge(*logs)
> with open('/tmp/merged_log', 'w') as output:
>    for stamp, line in merged_log:
>        output.write(line)

Oooh, I've never used 'with' before.  In fact I am currently
restricted to 2.4 on the machine on whch this will run.  That wasn't a
problem for heapq.merge, as I was just able to copy the code from the
2.6 source.  Or I could use Kent's recipe.

> ... which probably won't preserve the order of log entries that have the
> same timestamp, but if you need it to -- should be easy to accommodate.

I don't think  that is necessary, but I'm curious to know how...

Now... this is brilliant.  What it doesn't do that mine does, is
handle date - mine checks for whether it starts with the appropriate
date, so we can extract 24 hrs of data.  I'll need to try to include
that.  Also, I need to do some filtering and gsubbing, but I think I'm
firmly on the right path now, thanks to you.

> HTH,

Very much indeed.

S.