Hi, I have a second implementation of the load function [1], the previous one was too much naive. The main changes compared with the previous one are: 1) Uses a decay function instead of a vector of samples. 2) The load stands for the percentage of CPU used, giving a global view of how many CPU resources are still unused. 3) The update is done at each _LOAD_FREQ - default 1 second - without using a scheduler callback 4) Many corner cases fixed The performance impact introduced in the default loop implemented with Python is approx 3% running a trivial program that does not have an application overhead [2]. Therefore, in real applications with at least some footprint introduced by the application, this performance impact should be negligible. As an example of how the load method can be used, the following code [3] runs the loop using different ratios of coroutines per second, where each coroutine has a CPU impact of 0.02. Having a maximum throughput expected of 50 coroutines per second. In this example, the coroutine asks first for the load of the system before start consuming the CPU, if the load is higher than 0.9 the coroutine leaves doing nothing. The following snippet shows the execution output : Load reached for 10.0 coros/seq: 0.20594804872227002, abandoned 0/100 reseting load.... Load reached for 20.0 coros/seq: 0.40599215789994814, abandoned 0/200 reseting load.... Load reached for 40.0 coros/seq: 0.8055964270483202, abandoned 0/400 reseting load.... Load reached for 80.0 coros/seq: 0.9390106831339007, abandoned 450/800 The program runs as was said different levels of throughput printing at the final of each one the load of the system reached, and the coroutines that were abandoned vs the overall. As can be seen, once the last test begins to run and the load of the system reaches the 0.90 it is able to reduce the pressure at the application level, in that case just leaving doing nothing but in other environments doing the proper fallback. As you can notice the load values seem to be aligned with the values expected, having in mind that the maximum throughput reached would be 50 coroutines per second. The current implementation tries to return the load taking into account the global CPU resources, therefore other processes impacting on the use of the same CPU used by the loop should be considered. It would give to the developer a reliable metric that can be used in environments where the CPU is shared by other processes. What is missing? Investigate how difficult will be implemented this feature in libuv [4] to make it available in uvloop. Give more information about the difference between this implementation vs the toobusy one [5]. Test performance impact in real applications. I will like to get more feedback from you. And, if you believe that this implementation has some chances to be part of CPython repo which would be the next steps that I should make. [1] https://github.com/pfreixes/cpython/commit/ac07fef5af51746c7311494f21b0f067c... [2] https://gist.github.com/pfreixes/233fd8c6a6ec82f2cde4688a2976bf2d [3] https://gist.github.com/pfreixes/fd26c36391b33056b7efd525e4690aef [4] http://docs.libuv.org/en/v1.x/ [5] https://github.com/lloyd/node-toobusy On Sun, Aug 13, 2017 at 12:54 PM, Pau Freixes <pfreixes@gmail.com> wrote:
It looks like your "load average" is computing something very different than the traditional Unix "load average". If I'm reading right, yours is a measure of what percentage of the time the loop spent sleeping waiting for I/O, taken over the last 60 ticks of a 1 second timer (so generally slightly longer than 60 seconds). The traditional Unix load average is an exponentially weighted moving average of the length of the run queue.
The implementation proposed wants to expose the load of the loop. Having a direct metric that comes from the loop instead of using an external metric such as CPU, load average u others.
Yes, the load average uses a decay function based on the length of the run queue for those processes that are using or waiting for a CPU, this gives us extra information about how overloaded is our system. If you compare it with the CPU load.
In the case presented, the load of the loop is something equivalent with the load of the CPU and it does not have the ability to inform you about how much overloaded is your loop once reached the 100%.
Is one of those definitions better for your goal of detecting when to shed load? I don't know. But calling them the same thing is pretty confusing :-). The Unix version also has the nice property that it can actually go above 1; yours doesn't distinguish between a service whose load is at exactly 100% of capacity and barely keeping up, versus one that's at 200% of capacity and melting down. But for load shedding maybe you always want your tripwire to be below that anyway.
Well, I partially disagree with this. The load definition has its equivalent in computing with other metrics that have a close range, such as the CPU one. I've never had the intention to align the load of the loop with the load average, I've just used the concept as an example of the metric that might be used to check how loaded is your system.
More broadly we might ask what's the best possible metric for this purpose – how do we judge? A nice thing about the JavaScript library you mention is that scheduling delay is a real thing that directly impacts the quality of service – it's more of an "end to end" measure in a sense. Of course, if you really want an end to end measure you can do things like instrument your actual logic, see how fast you're replying to HTTP requests or whatever, which is even more valid but creates complications because some requests are supposed to take longer than others, etc. I don't know which design goals are important for real operations.
Here the key for me, something where I should have based my rationale. How good is the way presented to measure a load of your asynchronous system compared with the toobusy one? what can we achieve with this metric?
I will work on that as the base of my rationale for the change proposed. Then, once if the rationale is accepted the implementation is peanuts :)
-- --pau
-- --pau