[Tutor] Iterable Understanding
Stephen Nelson-Smith
sanelson at gmail.com
Mon Nov 16 01:11:07 CET 2009
Hi Marty,
Thanks for a very lucid reply!
> Well, you haven't described the unreliable behavior of unix sort so I
> can only guess, but I assume you know about the --month-sort (-M) flag?
Nope - but I can look it up. The problem I have is that the source
logs are rotated at 0400 hrs, so I need two days of logs in order to
extract 24 hrs from 0000 to 2359 (which is the requirement). At
present, I preprocess using sort, which works fine as long as the
month doesn't change.
> import gzip
> from heapq import heappush, heappop, merge
Is this a preferred method, rather than just 'import heapq'?
> def timestamp(line):
> # replace with your own timestamp function
> # this appears to work with the sample logs I chose
> stamp = ' '.join(line.split(' ', 3)[:-1])
> return time.strptime(stamp, '%b %d %H:%M:%S')
I have some logfie entries with multiple IP addresses, so I can't
split using whitespace.
> class LogFile(object):
> def __init__(self, filename, jitter=10):
> self.logfile = gzip.open(filename, 'r')
> self.heap = []
> self.jitter = jitter
>
> def __iter__(self):
> while True:
> for logline in self.logfile:
> heappush(self.heap, (timestamp(logline), logline))
> if len(self.heap) >= self.jitter:
> break
Really nice way to handle the batching of the initial heap - thank you!
> try:
> yield heappop(self.heap)
> except IndexError:
> raise StopIteration
>
> logs = [
> LogFile("/home/stephen/qa/ded1353/quick_log.gz"),
> LogFile("/home/stephen/qa/ded1408/quick_log.gz"),
> LogFile("/home/stephen/qa/ded1409/quick_log.gz")
> ]
>
> merged_log = merge(*logs)
> with open('/tmp/merged_log', 'w') as output:
> for stamp, line in merged_log:
> output.write(line)
Oooh, I've never used 'with' before. In fact I am currently
restricted to 2.4 on the machine on whch this will run. That wasn't a
problem for heapq.merge, as I was just able to copy the code from the
2.6 source. Or I could use Kent's recipe.
> ... which probably won't preserve the order of log entries that have the
> same timestamp, but if you need it to -- should be easy to accommodate.
I don't think that is necessary, but I'm curious to know how...
Now... this is brilliant. What it doesn't do that mine does, is
handle date - mine checks for whether it starts with the appropriate
date, so we can extract 24 hrs of data. I'll need to try to include
that. Also, I need to do some filtering and gsubbing, but I think I'm
firmly on the right path now, thanks to you.
> HTH,
Very much indeed.
S.
More information about the Tutor
mailing list