tldr: Quick solution would be changing root/pypi from a "mirror" to a "cache" which doesn't store anything in the DB and isn't replicated. The harder solution would be changes to the replication protocol and DB storage.
PyPI is growing quickly and that growth continues to affect the current way devpi handles root/pypi. We already had to make changes in the past due to the number of projects on PyPI.
First an overview of my understanding of why root/pypi works the way it does in devpi:
In the past PyPI was unreliable. It was down or slow quite often and that caused issues in day to day use. Several tools started adding mitigations for that. For example zc.buildout was one of the first tools that started caching data from PyPI locally to reduce download time and make repeated installations quicker.
Because devpi already provided the necessary data for pip and other tools, it made sense to cache PyPI data locally.
At some point replication was added to devpi. In regard to root/pypi, the thinking was, that all data the master already had from PyPI should be copied to the replicas, so all instances could provide the same view and once an installation with pip has worked, it should continue to work, regardless of which replica was used.
Now to the issues.
For replicas to be able to replicate all data easily and reliably without having to download all data when out of sync, we use serialized changesets. For the list of names from PyPI we stored that list in the DB. Problem was that the list was stored in full each time it changed. Because new projects are added all the time, that list took up quite a bit of space in the DB. We then changed it by only keeping the list in RAM.
Now the next issue is the growing number of releases per project. We store the infos from the links on the simple page of each project that was installed in the past. Whenever it is accessed again, it is updated when there are new releases. For a busy devpi instance that data can grow quite large.
We also store all accessed release files.
With devpi-web we added indexing for search. Here the number of projects starts to become an issue as well. A new devpi instance downloads the names of all projects on PyPI and indexes them. As of this writing, these are ~150000 names. Writing that index on the first commit takes several minutes with the Whoosh backend we currently have.
Storage also becomes an issue. Long running devpi instances have a constantly growing database and pile of package files. And currently there is no official way to clean that up and the workarounds can cause unforseen issues.
Now the question is where to go from here.
One idea is adding a "cache" index type, which doesn't replicate and either doesn't index at all, or only indexes currently cached data. Such an index would change behaviour of replicas, because each replica would have a different state of cached data.
Another solution would involve changing the replication protocol or at least some assumptions about it.
Currently replicas walk through all state changes of the master step by step and redo everything that happend to get to the current state. This is pretty wasteful most of the time. A new replica should get to the current state as quickly as possible. The full metadata of the current state isn't big. The biggest part are the release files. A new replica could get the full data for the current serial and then follow the individual changes from there. Fetching of the release files would work the same as it does now, except the initial list would be much bigger.
Once that works, we could start removing data from old serials. This can be a triggered operation like vacuum on databases and at a later point it might be possible to automate it. The master already keeps track of connected replicas. So it's pretty easy to check what the oldest synced serial of the replicas is and remove older ones. New replicas would get the full data of the current serial in one go. If there are replicas which have been out of sync for a longer time, they would also get a full set of metadata but can keep the release files they already have and update them.
We could and most likely should limit these cleanups to the mirror indexes and maybe deleted indexes.
Both solutions would solve the storage issue in different ways. The biggest problem would still be the indexing. Which will be in another mail.
Regards, Florian Schulze