[devpi-dev] Re: Thoughts on root/pypi and it's implications

24 Jul 2018

      Hi Florian,

It is great that you are investing keeping Devpi up to speed with increasing data sizes. However, Stephan Erb and I are not yet convinced that the introduction of the new cache type is the best approach to this problem. Our prime reason is that we expect this to significantly increase code complexity while other approaches you mention will solve the issue as well and seem like they could be built mostly using functionality already in the code base.

We also see issues with our current set-up if `root/pypi` were changed to caching index. We specifically use Devpi to be independent of our outside connectivity and we load balance between replicas. In this set-up, allowing replicas to get into an inconsistent state will likely cause hard to track down issues in case of connectivity loss.

In addition, in our set-up we also use Devpi-to-Devpi mirroring. Thus, any approach that benefits the mirroring mechanism will also allow us to profit from it there. In contrast, moving `root/pypi` to a different mechanism adds the danger of the mirroring code becoming less battle tested and thus less reliable.

We would much prefer the changes to the replication mechanism you mentioned. Especially compaction sounds like a very useful feature. It would not only speed up recovery of replicas, but having a smaller DB state should also be beneficial to the operation of Devpi itself. In addition, as we semi-regularly prune old `.devX` releases from our instances to reduce backup sizes, this would also allow us to profit from this with our other mirror indices. Because of this, for us it would even be interesting to compact non-mirror indices. But, as the critical systems only get those packages via mirror indices, this is of less importance.

As we see it, a lot of the building blocks required for the compaction mechanism already exist in the code base. E.g. we could imagine the state representation to reuse parts of the export mechanism. Thus, complexity is less of an issue with this. We even expect this to help with a class of errors we ran into in the past, where the code stumbled in case distributions had been added and removed before the replica was connected, as those distributions would no longer be present in the compacted state. The only real caveat we see is to make sure that a replica that has been disconnected for a while does not block compaction but you have already hinted at this case in your mail so we don't expect this to become an issue.

In case this simplifies implementation, for us it would be fully feasible to only run the compaction offline. We already take our masters offline each night to perform a filesystem snapshot for backup purposes. Thus, doing the same to perform compaction would not create an operational problem for us. It would probably even provide us with us better control over and monitoring of the compaction process compared to some on-the-fly implementation.

So, tldr, while adding adding a cache index type might seem like the quicker solution, code complexity probably levels out the implementation cost of both approaches and we see significant additional benefits in the compaction approach.

Kind Regards,
Matthias

--

Dr. Matthias Bach
Senior Software Engineer

matthias.bach@blue-yonder.com
T    +49 721 383117 6244

Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe

blue-yonder.com <http://www.blue-yonder.com/>
@BlueYonderTech <https://twitter.com/BlueYonderTech>
tech.blue-yonder.com <http://tech.blue-yonder.com/>

Registergericht Mannheim, HRB 704547 · USt-IdNr. DE 277 091 535 · Geschäftsführer: Uwe Weiss (CEO), Jochen Bossert

Diese E-Mail enthaelt vertrauliche oder geschuetzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten haben,
informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet.

This e-mail may contain confidential and/or privileged information. If you are not the recipient (or have received this e-mail in error) please notify the sender
immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Am 21.07.18, 11:26 schrieb "Florian Schulze" <mail@florian-schulze.net>:

    Hi!

    tldr: Quick solution would be changing root/pypi from a "mirror" to a 
    "cache" which doesn't store anything in the DB and isn't replicated. The 
    harder solution would be changes to the replication protocol and DB 
    storage.

    PyPI is growing quickly and that growth continues to affect the current 
    way devpi handles root/pypi. We already had to make changes in the past 
    due to the number of projects on PyPI.

    First an overview of my understanding of why root/pypi works the way it 
    does in devpi:

    In the past PyPI was unreliable. It was down or slow quite often and 
    that caused issues in day to day use. Several tools started adding 
    mitigations for that. For example zc.buildout was one of the first tools 
    that started caching data from PyPI locally to reduce download time and 
    make repeated installations quicker.

    Because devpi already provided the necessary data for pip and other 
    tools, it made sense to cache PyPI data locally.

    At some point replication was added to devpi. In regard to root/pypi, 
    the thinking was, that all data the master already had from PyPI should 
    be copied to the replicas, so all instances could provide the same view 
    and once an installation with pip has worked, it should continue to 
    work, regardless of which replica was used.

    Now to the issues.

    For replicas to be able to replicate all data easily and reliably 
    without having to download all data when out of sync, we use serialized 
    changesets. For the list of names from PyPI we stored that list in the 
    DB. Problem was that the list was stored in full each time it changed. 
    Because new projects are added all the time, that list took up quite a 
    bit of space in the DB. We then changed it by only keeping the list in 
    RAM.

    Now the next issue is the growing number of releases per project. We 
    store the infos from the links on the simple page of each project that 
    was installed in the past. Whenever it is accessed again, it is updated 
    when there are new releases. For a busy devpi instance that data can 
    grow quite large.

    We also store all accessed release files.

    With devpi-web we added indexing for search. Here the number of projects 
    starts to become an issue as well. A new devpi instance downloads the 
    names of all projects on PyPI and indexes them. As of this writing, 
    these are ~150000 names. Writing that index on the first commit takes 
    several minutes with the Whoosh backend we currently have.

    Storage also becomes an issue. Long running devpi instances have a 
    constantly growing database and pile of package files. And currently 
    there is no official way to clean that up and the workarounds can cause 
    unforseen issues.

    Now the question is where to go from here.

    One idea is adding a "cache" index type, which doesn't replicate and 
    either doesn't index at all, or only indexes currently cached data. Such 
    an index would change behaviour of replicas, because each replica would 
    have a different state of cached data.

    Another solution would involve changing the replication protocol or at 
    least some assumptions about it.

    Currently replicas walk through all state changes of the master step by 
    step and redo everything that happend to get to the current state. This 
    is pretty wasteful most of the time. A new replica should get to the 
    current state as quickly as possible. The full metadata of the current 
    state isn't big. The biggest part are the release files. A new replica 
    could get the full data for the current serial and then follow the 
    individual changes from there. Fetching of the release files would work 
    the same as it does now, except the initial list would be much bigger.

    Once that works, we could start removing data from old serials. This can 
    be a triggered operation like vacuum on databases and at a later point 
    it might be possible to automate it. The master already keeps track of 
    connected replicas. So it's pretty easy to check what the oldest synced 
    serial of the replicas is and remove older ones. New replicas would get 
    the full data of the current serial in one go. If there are replicas 
    which have been out of sync for a longer time, they would also get a 
    full set of metadata but can keep the release files they already have 
    and update them.

    We could and most likely should limit these cleanups to the mirror 
    indexes and maybe deleted indexes.

    Both solutions would solve the storage issue in different ways. The 
    biggest problem would still be the indexing. Which will be in another 
    mail.

    Regards,
    Florian Schulze
    _______________________________________________
    devpi-dev mailing list -- devpi-dev@python.org
    To unsubscribe send an email to devpi-dev-leave@python.org
    https://mail.python.org/mm3/mailman3/lists/devpi-dev.python.org/