[Tutor] Logfile Manipulation

Wayne Werner waynejwerner at gmail.com
Mon Nov 9 16:15:37 CET 2009


On Mon, Nov 9, 2009 at 7:46 AM, Stephen Nelson-Smith <sanelson at gmail.com>wrote:

> And the problem I have with the below is that I've discovered that the
> input logfiles aren't strictly ordered - ie there is variance by a
> second or so in some of the entries.
>

Within a given set of 10 lines, is the first line and last line "in order" -
i.e.

1
2
4
3
5
8
7
6
9
10


> I can sort the biggest logfile (800M) using unix sort in about 1.5
> mins on my workstation.  That's not really fast enough, with
> potentially 12 other files....
>

If that's the case, then I'm pretty sure you can create sort of a queue
system, and it should probably cut down on the sorting time. I don't know
what the default python sorting algorithm is on a list, but AFAIK you'd be
looking at a constant O(log 10) time on each insertion by doing something
such as this:


log_generator = (d for d in logdata)
mylist = # first ten values

while True:
    try:
        mylist.sort()
        nextdata = mylist.pop(0)
        mylist.append(log_generator.next())
    except StopIteration:
        print 'done'

    #Do something with nextdata

Or now that I look, python has a priority queue (
http://docs.python.org/library/heapq.html ) that you could use instead. Just
push the next value into the queue and pop one out - you give it some
initial qty - 10 or so, and then it will always give you the smallest value.

HTH,
Wayne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091109/80124e34/attachment.htm>


More information about the Tutor mailing list