Re: [Async-sig] Feedback, loop.load() function

20 Aug 2017

      Hi,

I have a second implementation of the load function [1], the previous
one was too much naive. The main changes compared with the previous
one are:

1) Uses a decay function instead of a vector of samples.
2) The load stands for the percentage of CPU used, giving a global
view of how many CPU resources are still unused.
3) The update is done at each _LOAD_FREQ - default 1 second - without
using a scheduler callback
4) Many corner cases fixed

The performance impact introduced in the default loop implemented with
Python is approx 3% running a trivial program that does not have an
application overhead [2]. Therefore, in real applications with at
least some footprint introduced by the application, this performance
impact should be negligible.

As an example of how the load method can be used, the following code
[3] runs the loop using different ratios of coroutines per second,
where each coroutine has a CPU impact of 0.02. Having a maximum
throughput expected of 50 coroutines per second. In this example, the
coroutine asks first for the load of the system before start consuming
the CPU, if the load is higher than 0.9 the coroutine leaves doing
nothing.

The following snippet shows the execution output :

Load reached for 10.0 coros/seq: 0.20594804872227002, abandoned 0/100
reseting load....
Load reached for 20.0 coros/seq: 0.40599215789994814, abandoned 0/200
reseting load....
Load reached for 40.0 coros/seq: 0.8055964270483202, abandoned 0/400
reseting load....
Load reached for 80.0 coros/seq: 0.9390106831339007, abandoned 450/800

The program runs as was said different levels of throughput printing
at the final of each one the load of the system reached, and the
coroutines that were abandoned vs the overall. As can be seen, once
the last test begins to run and the load of the system reaches the
0.90 it is able to reduce the pressure at the application level, in
that case just leaving doing nothing but in other environments doing
the proper fallback. As you can notice the load values seem to be
aligned with the values expected, having in mind that the maximum
throughput reached would be 50 coroutines per second.

The current implementation tries to return the load taking into
account the global CPU resources, therefore other processes impacting
on the use of the same CPU used by the loop should be considered. It
would give to the developer a reliable metric that can be used in
environments where the CPU is shared by other processes.

What is missing? Investigate how difficult will be implemented this
feature in libuv [4] to make it available in uvloop. Give more
information about the difference between this implementation vs the
toobusy one [5]. Test performance impact in real applications.

I will like to get more feedback from you. And, if you believe that
this implementation has some chances to be part of CPython repo which
would be the next steps that I should make.

[1] https://github.com/pfreixes/cpython/commit/ac07fef5af51746c7311494f21b0f067c...
[2] https://gist.github.com/pfreixes/233fd8c6a6ec82f2cde4688a2976bf2d
[3] https://gist.github.com/pfreixes/fd26c36391b33056b7efd525e4690aef
[4] http://docs.libuv.org/en/v1.x/
[5] https://github.com/lloyd/node-toobusy

On Sun, Aug 13, 2017 at 12:54 PM, Pau Freixes <pfreixes@gmail.com> wrote:
...
...
It looks like your "load average" is computing something very different than
the traditional Unix "load average". If I'm reading right, yours is a
measure of what percentage of the time the loop spent sleeping waiting for
I/O, taken over the last 60 ticks of a 1 second timer (so generally slightly
longer than 60 seconds). The traditional Unix load average is an
exponentially weighted moving average of the length of the run queue.
The implementation proposed wants to expose the load of the loop.
Having a direct metric that comes from the loop instead of using an
external metric such as CPU, load average u others.
Yes, the load average uses a decay function based on the length of the
run queue for those processes that are using or waiting for a CPU,
this gives us extra information about how overloaded is our system. If
you compare it with the CPU load.
In the case presented, the load of the loop is something equivalent
with the load of the CPU and it does not have the ability to inform
you about how much overloaded is your loop once reached the 100%.
...
Is one of those definitions better for your goal of detecting when to shed
load? I don't know. But calling them the same thing is pretty confusing :-).
The Unix version also has the nice property that it can actually go above 1;
yours doesn't distinguish between a service whose load is at exactly 100% of
capacity and barely keeping up, versus one that's at 200% of capacity and
melting down. But for load shedding maybe you always want your tripwire to
be below that anyway.
Well, I partially disagree with this. The load definition has its
equivalent in computing with other metrics that have a close range,
such as the CPU one. I've never had the intention to align the load of
the loop with the load average, I've just used the concept as an
example of the metric that might be used to check how loaded is your
system.
...
More broadly we might ask what's the best possible metric for this purpose –
how do we judge? A nice thing about the JavaScript library you mention is
that scheduling delay is a real thing that directly impacts the quality of
service – it's more of an "end to end" measure in a sense. Of course, if you
really want an end to end measure you can do things like instrument your
actual logic, see how fast you're replying to HTTP requests or whatever,
which is even more valid but creates complications because some requests are
supposed to take longer than others, etc. I don't know which design goals
are important for real operations.
Here the key for me, something where I should have based my rationale.
How good is the way presented to measure a load of your asynchronous
system compared with the toobusy one? what can we achieve with this
metric?
I will work on that as the base of my rationale for the change
proposed. Then, once if the rationale is accepted the implementation
is peanuts :)
--
--pau
-- 
--pau