I have attached you a screenshot that shows our existing Prometheus/Grafana dashboard for the Devpi Nginx instances. The interesting part is the spiky latency for both devpi-replicas. It is hard from the outside to judge what is going on internally that might cause those latency variations. For example, having additional metrics to detect when a replica is waiting for data from a master (or when a master is fetching data from a mirror) might help to pinpoint what is going on.
This is not a pressing issue for us right now. I mostly sharing this so that you get an idea of what could be of interest for users like us.
Have a nice weekend!
Stephan
On 05.04.19, 12:01, "Florian Schulze" wrote:
On 5 Apr 2019, at 10:22, Stephan Erb wrote:
> Hi Florian,
>
> this is a great idea. We actually thought about implementing something
> like this but never had the chance yet.
>
> We are running Nginx in front of our Devpi instances and therefore
> already have sufficient Prometheus metrics coverage of HTTP requests,
> request latencies, etc. What would still be helpful:
>
> * The master serial, current serial, and processed event serial. This
> would allow us to easily alert on lagging replicas.
> * The number of keyfs cache hits and cache misses so that we know when
> to tune the keyfs-cache-size
I thought of the above myself.
> * Some internal counters to figure out when and how often we are
> running into expired mirror caches.
Could you elaborate on this?
All the above would be possible solely from a plugin, so I think I will
go that route first.
Regards,
Florian Schulze