[Tutor] Logfile multiplexing

Wed Nov 11 14:43:39 CET 2009

Stephen Nelson-Smith wrote:
> Hi,
>
> On Wed, Nov 11, 2009 at 10:05 AM, Alan Gauld <alan.gauld at btinternet.com> wrote:
>   
>> "Stephen Nelson-Smith" <sanelson at gmail.com> wrote
>>     
>>> I don't really want to admit defeat and have a cron job sort the logs
>>> before entry.  Anyone got any other ideas?
>>>       
>> Why would that be admitting defeat?
>>     
>
> Well, it mean admitting defeat on solving the problem in python.  Yes
> in practical terms, I should probably preprocess the data, but as a
> programming exercise, learning how to sort a number of files into one
> is something I'd like to crack.
>
> Maybe the real lesson here is knowing which battles to fight, and a
> good developer uses the right tools for the job.
>
> S.
>
>   
That's certainly a valuable lesson.  On the other hand, I don't think 
you've run out of useful python options

The problem as I see it is that your input files are *mostly* sorted, 
and have an occasional out-of-order item.  And further that you're not 
sure how far out-of-order that can get;  you're not willing to pick a 
constant "10" and say that no item will have to be moved more than 10 
items.  Would you feel safe with 100 ?  Or 1000 ?

If you give it to "a cron job" the code will be very generalized, and 
perhaps not very efficient at sorting a list which is almost in order 
already.  So perhaps you can do better.  You might want to time it to be 
sure, or you might just decide that "good enough" doesn't need measuring.

Back to the fundamental problem.  Apparently the max amount of jitter is 
one second.  So let's say you're processing all the 10:01 events, and 
you encounter a 10:02 one.  You don't know whether you've really got all 
the 10:02 ones till you see your first 10:03.  So you need a priority 
queue combined with a couple of threshold values, in this case 10:01 and 
10:03.  Fill the queue till you hit the upper threshold, letting the 
queue grow as needed.  So you have all the 10:01 events.  Now you feed 
out your queue into the merge logic, and don't replace items as they are 
removed, until you have removed the last one at your lower threshold.  
At this point raise the thresholds by one, and fill it again as before.

Such a routine can be done as a generator function, with the two 
thresholds simply being local variables in the function.  And the 
generator can then be used in a merge function, just as any other 
iterator producing a sorted list would be used.

And of course you would need to carefully test such a generator, both 
with worst-case data, and against the live data.  If our assumptions 
about worst case jitter is wrong, the thresholds might need changing.

Finally, consider the possibility that if you pick a fixed jitter size 
that statistically nearly always works (like 1000 items).then you could 
just continuously compare the output data as it's being generated.  If 
once every 30 days it's out of order, then do a massive sort on the 
result file that day, and figure a speedup the other 29 days, with 
simpler code, was worth it.  In this case, the iterator is simply a 
priority queue of fixed size.

DaveA