Questions about high availability and replication
Hello, My organization is exploring replacing our current, patched-together "wheelhouse" with a proper Devpi server mirroring PyPi and hosting our internal packages. Because this Devpi architecture will be accessed by hundreds of developers and over 100 build server workers, it needs to be highly available (able to handle a lot of traffic). So this leads me to several questions (five major questions and a few sub-questions, to be more accurate): - Is anyone familiar with a threshold of diminishing returns when it comes to the "threads" server setting? At what point, if any, does increasing this number do no good? I assume that # of "threads" roughly equates to # of simultaneous connections that single server is capable of, perhaps minus one or two. Is this a fair assumption? - How many "devpi-server" instances can access a single "serverdir?" Is it just one, and bad things will happen if a second "devpi-server" tries to access it (and, if I need multiple servers, I should use master-replica replication)? Or can multiple "devpi-server" instances access a single "serverdir" to serve as multiple hot masters? - When employing master-replica replication, I'm wondering if the following configuration is sensible, or if there are better approaches: 1) One (or possibly more, if allowed) master, 2) Two or more replicas replicating off the master, 3) An HTTPS load balancer that sends all traffic to the replicas (not the master? or include the master in the load balancing?). In this case, what happens if someone uses the https://load-balanced-devpi-url/ URL to publish a new package? Does a replica it lands on just forward that up to the master? Or does the user get an error? Or do worse things happen? - When employing master-replica replication, I noticed in the documentation this word of caution: "You probably want to make sure that before starting the replica and/or master that you also have the Web interface and search plugin installed. You cannot install it later." To be clear, I already have "devpi-web" installed (and a theme, too), but this note confuses (and slightly concerns) me, so I'd like to understand it better. Why would you be unable to install "devpi-web" after starting a master or replica? What would it break? Why wouldn't this work? Or do you just mean that you would have to shut down the servers to install "devpi-web" (which is understandable)? Is this different from non-replication environments (can you install "devpi-web" after starting a non-replicating server), and if so, why? I'm asking for details about this because I'm curious if it has potential consequences (stability, etc.) that extend beyond just the "devpi-web" plugin. - When creating a new index, could someone elaborate a bit more on the "mirror_whitelist" setting? What is the difference between not setting it at all (default/implicit) and explicitly setting it to "*"? What does "*" actually mean? When you upload a new package to the index that conflicts with a package in its base index, is the resulting behavior that the uploaded package (or perhaps just the versions you upload) ALWAYS overrides the one in the base index, regardless of the "mirror_whitelist" setting? My goal here is to have an index that uses "root/pypi" as its base and also hosts our internal packages, so that we can point Pipenv to one URL and install everything from there. Is there a more sensible approach than that which I am planning on taking? Thanks in advance for helping me make the right decisions in our upcoming deployment. Nick Williams
On 30 Sep 2019, at 15:26, Nicholas Williams wrote:
Hello,
My organization is exploring replacing our current, patched-together "wheelhouse" with a proper Devpi server mirroring PyPi and hosting our internal packages. Because this Devpi architecture will be accessed by hundreds of developers and over 100 build server workers, it needs to be highly available (able to handle a lot of traffic). So this leads me to several questions (five major questions and a few sub-questions, to be more accurate):
The big question is, do you have mostly reads, or also many writes? The former can be handled in various ways, the latter is always limited by the "master" server and mostly just needs enough available resources (RAM and CPU). I would setup devpi-server with the default settings, nginx in front and monitoring the performance before doing anything more. With the default setup and the example configuration for nginx created with the --gen-config option all files are served by nginx. The only hits to devpi from pip then are for the release listings (/{user}/{index}/+simple/{project}), which are optimized by sniffing for the pip user-agent (and some others) and doing the least amount of work necessary. What you see in the browser has more work done than what pip sees. Depending on the results of your performance monitoring it might also be beneficial to add a cache like Varnish between nginx and devpi-server to cache the simple pages for a short interval (a few seconds and serving old data while fetching new data) to remove even more request from devpi-server if necessary. Everything else is mainly traffic by humans through the browser and depends on your use cases.
- Is anyone familiar with a threshold of diminishing returns when it comes to the "threads" server setting? At what point, if any, does increasing this number do no good? I assume that # of "threads" roughly equates to # of simultaneous connections that single server is capable of, perhaps minus one or two. Is this a fair assumption?
This is mostly a question about RAM. With nginx in front serving the files, the amount of long requests to devpi-server itself shouldn't be that high and the default should be good. The GIL is a factor here, so good monitoring is essential to see if and how this affects performance.
- How many "devpi-server" instances can access a single "serverdir?" Is it just one, and bad things will happen if a second "devpi-server" tries to access it (and, if I need multiple servers, I should use master-replica replication)? Or can multiple "devpi-server" instances access a single "serverdir" to serve as multiple hot masters?
Only one, but there is the option to let more use the same directory as read-only instances with --requests-only, but nowadays I think the options mentioned above are better.
- When employing master-replica replication, I'm wondering if the following configuration is sensible, or if there are better approaches: 1) One (or possibly more, if allowed) master, 2) Two or more replicas replicating off the master, 3) An HTTPS load balancer that sends all traffic to the replicas (not the master? or include the master in the load balancing?). In this case, what happens if someone uses the https://load-balanced-devpi-url/ URL to publish a new package? Does a replica it lands on just forward that up to the master? Or does the user get an error? Or do worse things happen?
Replication is primarily meant for availability (geographic and duplication), not performance. All reads are handled locally on the replicas, all writes are proxied to the single master. If you do use it for load-balancing, then it doesn't matter where requests go, but I would direct all writes to the master instead of letting the replicas proxy it. Replicas are also useful for backups.
- When employing master-replica replication, I noticed in the documentation this word of caution:
"You probably want to make sure that before starting the replica and/or master that you also have the Web interface and search plugin installed. You cannot install it later."
To be clear, I already have "devpi-web" installed (and a theme, too), but this note confuses (and slightly concerns) me, so I'd like to understand it better. Why would you be unable to install "devpi-web" after starting a master or replica? What would it break? Why wouldn't this work? Or do you just mean that you would have to shut down the servers to install "devpi-web" (which is understandable)? Is this different from non-replication environments (can you install "devpi-web" after starting a non-replicating server), and if so, why? I'm asking for details about this because I'm curious if it has potential consequences (stability, etc.) that extend beyond just the "devpi-web" plugin.
It's because of the way search indexing currently works. The upcoming devpi-web 4.0.0 release will do away with all that. ETA is early October.
- When creating a new index, could someone elaborate a bit more on the "mirror_whitelist" setting? What is the difference between not setting it at all (default/implicit) and explicitly setting it to "*"? What does "*" actually mean? When you upload a new package to the index that conflicts with a package in its base index, is the resulting behavior that the uploaded package (or perhaps just the versions you upload) ALWAYS overrides the one in the base index, regardless of the "mirror_whitelist" setting? My goal here is to have an index that uses "root/pypi" as its base and also hosts our internal packages, so that we can point Pipenv to one URL and install everything from there. Is there a more sensible approach than that which I am planning on taking?
The default setting of an empty mirror_whitelist prevents locally uploaded packages from being *infected* by mirrors. If you have a package "myname" uploaded to your index, then a package "myname" from the mirror is blocked. This is what most people want, as that prevents a "myname 9999.0" from trumping your own "myname 2.1". This gets in the way when you want to build binary wheels for packages which don't have one available. For that use case one can list all packages that are affected by that in "mirror_whitelist". In case this is a lot of work, the recommended way is to have two indexes. The first inherits from the mirror, is for the wheels and has "mirror_whitelist=*", so any uploaded wheel mixes with releases from the mirror. The second index inherits from the first, is for your private packages only and has a by default empty "mirror_whitelist". Installations only use that second index to always be safe. Regards, Florian Schulze
This was super helpful. I have replied more inline:
On Sep 30, 2019, at 09:14, Florian Schulze <mail@florian-schulze.net> wrote:
On 30 Sep 2019, at 15:26, Nicholas Williams wrote:
Hello,
My organization is exploring replacing our current, patched-together "wheelhouse" with a proper Devpi server mirroring PyPi and hosting our internal packages. Because this Devpi architecture will be accessed by hundreds of developers and over 100 build server workers, it needs to be highly available (able to handle a lot of traffic). So this leads me to several questions (five major questions and a few sub-questions, to be more accurate):
The big question is, do you have mostly reads, or also many writes? The former can be handled in various ways, the latter is always limited by the "master" server and mostly just needs enough available resources (RAM and CPU).
Mostly reads. Potentially many hundreds of reads per second during the busiest time of day, but at most only two or three writes per second.
I would setup devpi-server with the default settings, nginx in front and monitoring the performance before doing anything more.
Yeah, I should have noted in my original email that I was planning on using Nginx in front of this. I already have it configured in my prototype. Do you have any kind of documentation on performance monitoring? I didn’t see that in my initial pass through the documentation, but I might have missed it. I would love more information on how to monitor performance as you suggest.
With the default setup and the example configuration for nginx created with the --gen-config option all files are served by nginx. The only hits to devpi from pip then are for the release listings (/{user}/{index}/+simple/{project}), which are optimized by sniffing for the pip user-agent (and some others) and doing the least amount of work necessary. What you see in the browser has more work done than what pip sees.
That is a great point. When I wrote this email, I forgot about the fact that files would be served directly instead of going through the Python process.
Depending on the results of your performance monitoring it might also be beneficial to add a cache like Varnish between nginx and devpi-server to cache the simple pages for a short interval (a few seconds and serving old data while fetching new data) to remove even more request from devpi-server if necessary. Everything else is mainly traffic by humans through the browser and depends on your use cases.
- Is anyone familiar with a threshold of diminishing returns when it comes to the "threads" server setting? At what point, if any, does increasing this number do no good? I assume that # of "threads" roughly equates to # of simultaneous connections that single server is capable of, perhaps minus one or two. Is this a fair assumption?
This is mostly a question about RAM. With nginx in front serving the files, the amount of long requests to devpi-server itself shouldn't be that high and the default should be good. The GIL is a factor here, so good monitoring is essential to see if and how this affects performance.
Good point about the GIL. I am thinking of initially setting this value to 16. Does that sound sensible? It is what I would set my Django processes to, but that doesn’t necessarily mean it is appropriate for Devpi.
- How many "devpi-server" instances can access a single "serverdir?" Is it just one, and bad things will happen if a second "devpi-server" tries to access it (and, if I need multiple servers, I should use master-replica replication)? Or can multiple "devpi-server" instances access a single "serverdir" to serve as multiple hot masters?
Only one, but there is the option to let more use the same directory as read-only instances with --requests-only, but nowadays I think the options mentioned above are better.
Understood.
- When employing master-replica replication, I'm wondering if the following configuration is sensible, or if there are better approaches: 1) One (or possibly more, if allowed) master, 2) Two or more replicas replicating off the master, 3) An HTTPS load balancer that sends all traffic to the replicas (not the master? or include the master in the load balancing?). In this case, what happens if someone uses the https://load-balanced-devpi-url/ URL to publish a new package? Does a replica it lands on just forward that up to the master? Or does the user get an error? Or do worse things happen?
Replication is primarily meant for availability (geographic and duplication), not performance. All reads are handled locally on the replicas, all writes are proxied to the single master. If you do use it for load-balancing, then it doesn't matter where requests go, but I would direct all writes to the master instead of letting the replicas proxy it. Replicas are also useful for backups.
Perfect. To be clear, I was already planning on using replication for geographical availability at a later point in this process, after we get the initial single-location system working. So I thought it might be useful for load balancing in the single location, too.
- When employing master-replica replication, I noticed in the documentation this word of caution:
"You probably want to make sure that before starting the replica and/or master that you also have the Web interface and search plugin installed. You cannot install it later."
To be clear, I already have "devpi-web" installed (and a theme, too), but this note confuses (and slightly concerns) me, so I'd like to understand it better. Why would you be unable to install "devpi-web" after starting a master or replica? What would it break? Why wouldn't this work? Or do you just mean that you would have to shut down the servers to install "devpi-web" (which is understandable)? Is this different from non-replication environments (can you install "devpi-web" after starting a non-replicating server), and if so, why? I'm asking for details about this because I'm curious if it has potential consequences (stability, etc.) that extend beyond just the "devpi-web" plugin.
It's because of the way search indexing currently works. The upcoming devpi-web 4.0.0 release will do away with all that. ETA is early October.
Good to know.
- When creating a new index, could someone elaborate a bit more on the "mirror_whitelist" setting? What is the difference between not setting it at all (default/implicit) and explicitly setting it to "*"? What does "*" actually mean? When you upload a new package to the index that conflicts with a package in its base index, is the resulting behavior that the uploaded package (or perhaps just the versions you upload) ALWAYS overrides the one in the base index, regardless of the "mirror_whitelist" setting? My goal here is to have an index that uses "root/pypi" as its base and also hosts our internal packages, so that we can point Pipenv to one URL and install everything from there. Is there a more sensible approach than that which I am planning on taking?
The default setting of an empty mirror_whitelist prevents locally uploaded packages from being *infected* by mirrors. If you have a package "myname" uploaded to your index, then a package "myname" from the mirror is blocked. This is what most people want, as that prevents a "myname 9999.0" from trumping your own "myname 2.1".
This gets in the way when you want to build binary wheels for packages which don't have one available. For that use case one can list all packages that are affected by that in "mirror_whitelist". In case this is a lot of work, the recommended way is to have two indexes. The first inherits from the mirror, is for the wheels and has "mirror_whitelist=*", so any uploaded wheel mixes with releases from the mirror. The second index inherits from the first, is for your private packages only and has a by default empty "mirror_whitelist". Installations only use that second index to always be safe.
Oh, excellent! That is exactly what I needed to know. Now I know that I need to make some changes and have two indexes like you suggest. If I may make a suggestion, the documentation could use a section on this setting to explain exactly what you have just explained (unless I am completely blind and that is already there).
Regards, Florian Schulze
Thanks! Nick
On 30 Sep 2019, at 16:52, Nicholas Williams wrote:
This was super helpful. I have replied more inline:
On Sep 30, 2019, at 09:14, Florian Schulze <mail@florian-schulze.net> wrote:
On 30 Sep 2019, at 15:26, Nicholas Williams wrote:
Hello,
My organization is exploring replacing our current, patched-together "wheelhouse" with a proper Devpi server mirroring PyPi and hosting our internal packages. Because this Devpi architecture will be accessed by hundreds of developers and over 100 build server workers, it needs to be highly available (able to handle a lot of traffic). So this leads me to several questions (five major questions and a few sub-questions, to be more accurate):
The big question is, do you have mostly reads, or also many writes? The former can be handled in various ways, the latter is always limited by the "master" server and mostly just needs enough available resources (RAM and CPU).
Mostly reads. Potentially many hundreds of reads per second during the busiest time of day, but at most only two or three writes per second.
Then try my suggestions for the simpler setup first.
I would setup devpi-server with the default settings, nginx in front and monitoring the performance before doing anything more.
Yeah, I should have noted in my original email that I was planning on using Nginx in front of this. I already have it configured in my prototype. Do you have any kind of documentation on performance monitoring? I didn’t see that in my initial pass through the documentation, but I might have missed it. I would love more information on how to monitor performance as you suggest.
Standard stuff like nginx/Varnish request times, backend (devpi-server) fetches etc, depending on the tool you use there should be plenty of documentation. For devpi-server there isn't something specific yet. I hope to do a Prometheus output soon, which will require new features from upcoming devpi-server 5.2.0. With that you can monitor the internal devpi-server specific cache.
With the default setup and the example configuration for nginx created with the --gen-config option all files are served by nginx. The only hits to devpi from pip then are for the release listings (/{user}/{index}/+simple/{project}), which are optimized by sniffing for the pip user-agent (and some others) and doing the least amount of work necessary. What you see in the browser has more work done than what pip sees.
That is a great point. When I wrote this email, I forgot about the fact that files would be served directly instead of going through the Python process.
Depending on the results of your performance monitoring it might also be beneficial to add a cache like Varnish between nginx and devpi-server to cache the simple pages for a short interval (a few seconds and serving old data while fetching new data) to remove even more request from devpi-server if necessary. Everything else is mainly traffic by humans through the browser and depends on your use cases.
- Is anyone familiar with a threshold of diminishing returns when it comes to the "threads" server setting? At what point, if any, does increasing this number do no good? I assume that # of "threads" roughly equates to # of simultaneous connections that single server is capable of, perhaps minus one or two. Is this a fair assumption?
This is mostly a question about RAM. With nginx in front serving the files, the amount of long requests to devpi-server itself shouldn't be that high and the default should be good. The GIL is a factor here, so good monitoring is essential to see if and how this affects performance.
Good point about the GIL. I am thinking of initially setting this value to 16. Does that sound sensible? It is what I would set my Django processes to, but that doesn’t necessarily mean it is appropriate for Devpi.
I would leave it at the default at first. It is the default from waitress, the wsgi server part we use.
- How many "devpi-server" instances can access a single "serverdir?" Is it just one, and bad things will happen if a second "devpi-server" tries to access it (and, if I need multiple servers, I should use master-replica replication)? Or can multiple "devpi-server" instances access a single "serverdir" to serve as multiple hot masters?
Only one, but there is the option to let more use the same directory as read-only instances with --requests-only, but nowadays I think the options mentioned above are better.
Understood.
- When employing master-replica replication, I'm wondering if the following configuration is sensible, or if there are better approaches: 1) One (or possibly more, if allowed) master, 2) Two or more replicas replicating off the master, 3) An HTTPS load balancer that sends all traffic to the replicas (not the master? or include the master in the load balancing?). In this case, what happens if someone uses the https://load-balanced-devpi-url/ URL to publish a new package? Does a replica it lands on just forward that up to the master? Or does the user get an error? Or do worse things happen?
Replication is primarily meant for availability (geographic and duplication), not performance. All reads are handled locally on the replicas, all writes are proxied to the single master. If you do use it for load-balancing, then it doesn't matter where requests go, but I would direct all writes to the master instead of letting the replicas proxy it. Replicas are also useful for backups.
Perfect. To be clear, I was already planning on using replication for geographical availability at a later point in this process, after we get the initial single-location system working. So I thought it might be useful for load balancing in the single location, too.
It might be, but I would measure first before throwing more servers at it.
- When employing master-replica replication, I noticed in the documentation this word of caution:
"You probably want to make sure that before starting the replica and/or master that you also have the Web interface and search plugin installed. You cannot install it later."
To be clear, I already have "devpi-web" installed (and a theme, too), but this note confuses (and slightly concerns) me, so I'd like to understand it better. Why would you be unable to install "devpi-web" after starting a master or replica? What would it break? Why wouldn't this work? Or do you just mean that you would have to shut down the servers to install "devpi-web" (which is understandable)? Is this different from non-replication environments (can you install "devpi-web" after starting a non-replicating server), and if so, why? I'm asking for details about this because I'm curious if it has potential consequences (stability, etc.) that extend beyond just the "devpi-web" plugin.
It's because of the way search indexing currently works. The upcoming devpi-web 4.0.0 release will do away with all that. ETA is early October.
Good to know.
- When creating a new index, could someone elaborate a bit more on the "mirror_whitelist" setting? What is the difference between not setting it at all (default/implicit) and explicitly setting it to "*"? What does "*" actually mean? When you upload a new package to the index that conflicts with a package in its base index, is the resulting behavior that the uploaded package (or perhaps just the versions you upload) ALWAYS overrides the one in the base index, regardless of the "mirror_whitelist" setting? My goal here is to have an index that uses "root/pypi" as its base and also hosts our internal packages, so that we can point Pipenv to one URL and install everything from there. Is there a more sensible approach than that which I am planning on taking?
The default setting of an empty mirror_whitelist prevents locally uploaded packages from being *infected* by mirrors. If you have a package "myname" uploaded to your index, then a package "myname" from the mirror is blocked. This is what most people want, as that prevents a "myname 9999.0" from trumping your own "myname 2.1".
This gets in the way when you want to build binary wheels for packages which don't have one available. For that use case one can list all packages that are affected by that in "mirror_whitelist". In case this is a lot of work, the recommended way is to have two indexes. The first inherits from the mirror, is for the wheels and has "mirror_whitelist=*", so any uploaded wheel mixes with releases from the mirror. The second index inherits from the first, is for your private packages only and has a by default empty "mirror_whitelist". Installations only use that second index to always be safe.
Oh, excellent! That is exactly what I needed to know. Now I know that I need to make some changes and have two indexes like you suggest. If I may make a suggestion, the documentation could use a section on this setting to explain exactly what you have just explained (unless I am completely blind and that is already there).
The documentation could use some love in general. So far we didn't have anyone who sponsored it. At https://merlinux.eu/ you find contact information in case you want a support contract. Glad I could help! Regards, Florian Schulze
One more question. Based on your advice re: mirror_whitelist, I re-did my prototype and set it up as follows: - "my_org/upstream" extends "root/pypi" with mirror_whitelist='*' ... to there, we uploaded custom wheels we built of third-party packages that are in PyPi but don't have wheels in PyPi. This worked as expected: I can see both the "root/pypi" versions and our "my_org/upstream" versions in this index. - "my_org/stable" extends "my_org/upstream" with mirror_whitelist=<unset> ... to there, we uploaded our internal packages only. This is where it's not behaving as you described. In one internal package we have that conflicts with the name of a package on PyPi, both the "root/pypi" packages and the "my_org/stable" packages are listed. Thoughts? Did I do something wrong? Thanks, Nick On Mon, Sep 30, 2019 at 11:52 AM Florian Schulze <mail@florian-schulze.net> wrote:
On 30 Sep 2019, at 16:52, Nicholas Williams wrote:
This was super helpful. I have replied more inline:
On Sep 30, 2019, at 09:14, Florian Schulze <mail@florian-schulze.net> wrote:
On 30 Sep 2019, at 15:26, Nicholas Williams wrote:
Hello,
My organization is exploring replacing our current, patched-together "wheelhouse" with a proper Devpi server mirroring PyPi and hosting our internal packages. Because this Devpi architecture will be accessed by hundreds of developers and over 100 build server workers, it needs to be highly available (able to handle a lot of traffic). So this leads me to several questions (five major questions and a few sub-questions, to be more accurate):
The big question is, do you have mostly reads, or also many writes? The former can be handled in various ways, the latter is always limited by the "master" server and mostly just needs enough available resources (RAM and CPU).
Mostly reads. Potentially many hundreds of reads per second during the busiest time of day, but at most only two or three writes per second.
Then try my suggestions for the simpler setup first.
I would setup devpi-server with the default settings, nginx in front and monitoring the performance before doing anything more.
Yeah, I should have noted in my original email that I was planning on using Nginx in front of this. I already have it configured in my prototype. Do you have any kind of documentation on performance monitoring? I didn’t see that in my initial pass through the documentation, but I might have missed it. I would love more information on how to monitor performance as you suggest.
Standard stuff like nginx/Varnish request times, backend (devpi-server) fetches etc, depending on the tool you use there should be plenty of documentation.
For devpi-server there isn't something specific yet. I hope to do a Prometheus output soon, which will require new features from upcoming devpi-server 5.2.0. With that you can monitor the internal devpi-server specific cache.
With the default setup and the example configuration for nginx created with the --gen-config option all files are served by nginx. The only hits to devpi from pip then are for the release listings (/{user}/{index}/+simple/{project}), which are optimized by sniffing for the pip user-agent (and some others) and doing the least amount of work necessary. What you see in the browser has more work done than what pip sees.
That is a great point. When I wrote this email, I forgot about the fact that files would be served directly instead of going through the Python process.
Depending on the results of your performance monitoring it might also be beneficial to add a cache like Varnish between nginx and devpi-server to cache the simple pages for a short interval (a few seconds and serving old data while fetching new data) to remove even more request from devpi-server if necessary. Everything else is mainly traffic by humans through the browser and depends on your use cases.
- Is anyone familiar with a threshold of diminishing returns when it comes to the "threads" server setting? At what point, if any, does increasing this number do no good? I assume that # of "threads" roughly equates to # of simultaneous connections that single server is capable of, perhaps minus one or two. Is this a fair assumption?
This is mostly a question about RAM. With nginx in front serving the files, the amount of long requests to devpi-server itself shouldn't be that high and the default should be good. The GIL is a factor here, so good monitoring is essential to see if and how this affects performance.
Good point about the GIL. I am thinking of initially setting this value to 16. Does that sound sensible? It is what I would set my Django processes to, but that doesn’t necessarily mean it is appropriate for Devpi.
I would leave it at the default at first. It is the default from waitress, the wsgi server part we use.
- How many "devpi-server" instances can access a single "serverdir?" Is it just one, and bad things will happen if a second "devpi-server" tries to access it (and, if I need multiple servers, I should use master-replica replication)? Or can multiple "devpi-server" instances access a single "serverdir" to serve as multiple hot masters?
Only one, but there is the option to let more use the same directory as read-only instances with --requests-only, but nowadays I think the options mentioned above are better.
Understood.
- When employing master-replica replication, I'm wondering if the following configuration is sensible, or if there are better approaches: 1) One (or possibly more, if allowed) master, 2) Two or more replicas replicating off the master, 3) An HTTPS load balancer that sends all traffic to the replicas (not the master? or include the master in the load balancing?). In this case, what happens if someone uses the https://load-balanced-devpi-url/ URL to publish a new package? Does a replica it lands on just forward that up to the master? Or does the user get an error? Or do worse things happen?
Replication is primarily meant for availability (geographic and duplication), not performance. All reads are handled locally on the replicas, all writes are proxied to the single master. If you do use it for load-balancing, then it doesn't matter where requests go, but I would direct all writes to the master instead of letting the replicas proxy it. Replicas are also useful for backups.
Perfect. To be clear, I was already planning on using replication for geographical availability at a later point in this process, after we get the initial single-location system working. So I thought it might be useful for load balancing in the single location, too.
It might be, but I would measure first before throwing more servers at it.
- When employing master-replica replication, I noticed in the documentation this word of caution:
"You probably want to make sure that before starting the replica and/or master that you also have the Web interface and search plugin installed. You cannot install it later."
To be clear, I already have "devpi-web" installed (and a theme, too), but this note confuses (and slightly concerns) me, so I'd like to understand it better. Why would you be unable to install "devpi-web" after starting a master or replica? What would it break? Why wouldn't this work? Or do you just mean that you would have to shut down the servers to install "devpi-web" (which is understandable)? Is this different from non-replication environments (can you install "devpi-web" after starting a non-replicating server), and if so, why? I'm asking for details about this because I'm curious if it has potential consequences (stability, etc.) that extend beyond just the "devpi-web" plugin.
It's because of the way search indexing currently works. The upcoming devpi-web 4.0.0 release will do away with all that. ETA is early October.
Good to know.
- When creating a new index, could someone elaborate a bit more on the "mirror_whitelist" setting? What is the difference between not setting it at all (default/implicit) and explicitly setting it to "*"? What does "*" actually mean? When you upload a new package to the index that conflicts with a package in its base index, is the resulting behavior that the uploaded package (or perhaps just the versions you upload) ALWAYS overrides the one in the base index, regardless of the "mirror_whitelist" setting? My goal here is to have an index that uses "root/pypi" as its base and also hosts our internal packages, so that we can point Pipenv to one URL and install everything from there. Is there a more sensible approach than that which I am planning on taking?
The default setting of an empty mirror_whitelist prevents locally uploaded packages from being *infected* by mirrors. If you have a package "myname" uploaded to your index, then a package "myname" from the mirror is blocked. This is what most people want, as that prevents a "myname 9999.0" from trumping your own "myname 2.1".
This gets in the way when you want to build binary wheels for packages which don't have one available. For that use case one can list all packages that are affected by that in "mirror_whitelist". In case this is a lot of work, the recommended way is to have two indexes. The first inherits from the mirror, is for the wheels and has "mirror_whitelist=*", so any uploaded wheel mixes with releases from the mirror. The second index inherits from the first, is for your private packages only and has a by default empty "mirror_whitelist". Installations only use that second index to always be safe.
Oh, excellent! That is exactly what I needed to know. Now I know that I need to make some changes and have two indexes like you suggest. If I may make a suggestion, the documentation could use a section on this setting to explain exactly what you have just explained (unless I am completely blind and that is already there).
The documentation could use some love in general. So far we didn't have anyone who sponsored it. At https://merlinux.eu/ you find contact information in case you want a support contract.
Glad I could help!
Regards, Florian Schulze
On 1 Oct 2019, at 4:47, Nicholas Williams wrote:
One more question. Based on your advice re: mirror_whitelist, I re-did my prototype and set it up as follows:
- "my_org/upstream" extends "root/pypi" with mirror_whitelist='*' ... to there, we uploaded custom wheels we built of third-party packages that are in PyPi but don't have wheels in PyPi. This worked as expected: I can see both the "root/pypi" versions and our "my_org/upstream" versions in this index. - "my_org/stable" extends "my_org/upstream" with mirror_whitelist=<unset> ... to there, we uploaded our internal packages only. This is where it's not behaving as you described. In one internal package we have that conflicts with the name of a package on PyPi, both the "root/pypi" packages and the "my_org/stable" packages are listed.
Thoughts? Did I do something wrong?
https://github.com/devpi/devpi/issues/725 Regards, Florian Schulze
participants (2)
-
Florian Schulze
-
Nicholas Williams