Hi!
tldr: Quick solution would be changing root/pypi from a "mirror" to a
"cache" which doesn't store anything in the DB and isn't replicated. The
harder solution would be changes to the replication protocol and DB
storage.
PyPI is growing quickly and that growth continues to affect the current
way devpi handles root/pypi. We already had to make changes in the past
due to the number of projects on PyPI.
First an overview of my understanding of why root/pypi works the way it
does in devpi:
In the past PyPI was unreliable. It was down or slow quite often and
that caused issues in day to day use. Several tools started adding
mitigations for that. For example zc.buildout was one of the first tools
that started caching data from PyPI locally to reduce download time and
make repeated installations quicker.
Because devpi already provided the necessary data for pip and other
tools, it made sense to cache PyPI data locally.
At some point replication was added to devpi. In regard to root/pypi,
the thinking was, that all data the master already had from PyPI should
be copied to the replicas, so all instances could provide the same view
and once an installation with pip has worked, it should continue to
work, regardless of which replica was used.
Now to the issues.
For replicas to be able to replicate all data easily and reliably
without having to download all data when out of sync, we use serialized
changesets. For the list of names from PyPI we stored that list in the
DB. Problem was that the list was stored in full each time it changed.
Because new projects are added all the time, that list took up quite a
bit of space in the DB. We then changed it by only keeping the list in
RAM.
Now the next issue is the growing number of releases per project. We
store the infos from the links on the simple page of each project that
was installed in the past. Whenever it is accessed again, it is updated
when there are new releases. For a busy devpi instance that data can
grow quite large.
We also store all accessed release files.
With devpi-web we added indexing for search. Here the number of projects
starts to become an issue as well. A new devpi instance downloads the
names of all projects on PyPI and indexes them. As of this writing,
these are ~150000 names. Writing that index on the first commit takes
several minutes with the Whoosh backend we currently have.
Storage also becomes an issue. Long running devpi instances have a
constantly growing database and pile of package files. And currently
there is no official way to clean that up and the workarounds can cause
unforseen issues.
Now the question is where to go from here.
One idea is adding a "cache" index type, which doesn't replicate and
either doesn't index at all, or only indexes currently cached data. Such
an index would change behaviour of replicas, because each replica would
have a different state of cached data.
Another solution would involve changing the replication protocol or at
least some assumptions about it.
Currently replicas walk through all state changes of the master step by
step and redo everything that happend to get to the current state. This
is pretty wasteful most of the time. A new replica should get to the
current state as quickly as possible. The full metadata of the current
state isn't big. The biggest part are the release files. A new replica
could get the full data for the current serial and then follow the
individual changes from there. Fetching of the release files would work
the same as it does now, except the initial list would be much bigger.
Once that works, we could start removing data from old serials. This can
be a triggered operation like vacuum on databases and at a later point
it might be possible to automate it. The master already keeps track of
connected replicas. So it's pretty easy to check what the oldest synced
serial of the replicas is and remove older ones. New replicas would get
the full data of the current serial in one go. If there are replicas
which have been out of sync for a longer time, they would also get a
full set of metadata but can keep the release files they already have
and update them.
We could and most likely should limit these cleanups to the mirror
indexes and maybe deleted indexes.
Both solutions would solve the storage issue in different ways. The
biggest problem would still be the indexing. Which will be in another
mail.
Regards,
Florian Schulze