[Tutor] Logfile multiplexing
davea at ieee.org
Wed Nov 11 14:43:39 CET 2009
Stephen Nelson-Smith wrote:
> On Wed, Nov 11, 2009 at 10:05 AM, Alan Gauld <alan.gauld at btinternet.com> wrote:
>> "Stephen Nelson-Smith" <sanelson at gmail.com> wrote
>>> I don't really want to admit defeat and have a cron job sort the logs
>>> before entry. Anyone got any other ideas?
>> Why would that be admitting defeat?
> Well, it mean admitting defeat on solving the problem in python. Yes
> in practical terms, I should probably preprocess the data, but as a
> programming exercise, learning how to sort a number of files into one
> is something I'd like to crack.
> Maybe the real lesson here is knowing which battles to fight, and a
> good developer uses the right tools for the job.
That's certainly a valuable lesson. On the other hand, I don't think
you've run out of useful python options
The problem as I see it is that your input files are *mostly* sorted,
and have an occasional out-of-order item. And further that you're not
sure how far out-of-order that can get; you're not willing to pick a
constant "10" and say that no item will have to be moved more than 10
items. Would you feel safe with 100 ? Or 1000 ?
If you give it to "a cron job" the code will be very generalized, and
perhaps not very efficient at sorting a list which is almost in order
already. So perhaps you can do better. You might want to time it to be
sure, or you might just decide that "good enough" doesn't need measuring.
Back to the fundamental problem. Apparently the max amount of jitter is
one second. So let's say you're processing all the 10:01 events, and
you encounter a 10:02 one. You don't know whether you've really got all
the 10:02 ones till you see your first 10:03. So you need a priority
queue combined with a couple of threshold values, in this case 10:01 and
10:03. Fill the queue till you hit the upper threshold, letting the
queue grow as needed. So you have all the 10:01 events. Now you feed
out your queue into the merge logic, and don't replace items as they are
removed, until you have removed the last one at your lower threshold.
At this point raise the thresholds by one, and fill it again as before.
Such a routine can be done as a generator function, with the two
thresholds simply being local variables in the function. And the
generator can then be used in a merge function, just as any other
iterator producing a sorted list would be used.
And of course you would need to carefully test such a generator, both
with worst-case data, and against the live data. If our assumptions
about worst case jitter is wrong, the thresholds might need changing.
Finally, consider the possibility that if you pick a fixed jitter size
that statistically nearly always works (like 1000 items).then you could
just continuously compare the output data as it's being generated. If
once every 30 days it's out of order, then do a massive sort on the
result file that day, and figure a speedup the other 29 days, with
simpler code, was worth it. In this case, the iterator is simply a
priority queue of fixed size.
More information about the Tutor