Thoughts on indexing
Hi! The way indexing is currently implemented in devpi-web has some performance issues. One problem is the number of registered projects on PyPI which creates a huge initial commit for new devpi instances which can take hours to complete and currently causes reads from PyPI which can fail at any point and abort the indexing. The other problem is, that any change in a project, even just a new version number causes a full reindex of the project, including it's full documentation if it exists. For the first issue I have a proposal, for the second I still have no idea, but the impact might be reduced. Use the following ticket to give thumbs up/down or whatever reaction, but keep discussions in this mail for now. https://github.com/devpi/devpi/issues/566 Regards, Florian Schulze
Hi Florian, We are happy to see that you plan to address the performance issues related to the indexing of devpi-web. Your proposal regarding splitting up the search indexing sounds good. Am I correct in assuming that this will mean that we will get a replica that, from the API point of view, catches up as quickly as one without devpi-web and, afterwards, will start returning reasonable search results step by step? This would help us with disaster recovery and, even more, with the export-import cycle for the major version upgrade. The second problem you mention in your mail, the reindexing of whole projects, is however the one more interesting for us in day to day operations. We have some projects with a significant number of versions and documentation of significant size. For these projects, uploading a version will mean that the replica process will be stuck indexing for around 10 minutes. We are currently mitigating this issue by running additional `--requests-only` replica process. However, if the load induced could by uploading a new version could be reduced conceptually this would of course be the better solution. There might be two limitations that a solution for this second problem could have that would be non-issues for us: 1. It could be constrained to non-volatile indices. As for these versions can't change this might allow for some optimization. 2. Indexing could be running in a separate process. This would avoid indexing hogging the global lock of the interpreter running the replication process. Additional code complexity might be an issue in this case, though. In any case, we are looking forward to any improvements in this area. Kind Regards, Matthias -- Dr. Matthias Bach Senior Software Engineer matthias.bach@blue-yonder.com T +49 721 383117 6244 Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe blue-yonder.com http://www.blue-yonder.com/ @BlueYonderTech https://twitter.com/BlueYonderTech tech.blue-yonder.com http://tech.blue-yonder.com/ Registergericht Mannheim, HRB 704547 · USt-IdNr. DE 277 091 535 · Geschäftsführer: Uwe Weiss (CEO), Jochen Bossert Diese E-Mail enthaelt vertrauliche oder geschuetzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. This e-mail may contain confidential and/or privileged information. If you are not the recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
On 24 Jul 2018, at 16:57, Matthias Bach wrote:
Your proposal regarding splitting up the search indexing sounds good. Am I correct in assuming that this will mean that we will get a replica that, from the API point of view, catches up as quickly as one without devpi-web and, afterwards, will start returning reasonable search results step by step? This would help us with disaster recovery and, even more, with the export-import cycle for the major version upgrade.
That is the goal, yes. The replicas should have the metadata state as quickly as possible, and catch up with everything else, like documentation unzipping, description rendering and indexing. Another thing would be to split off the file downloads as well. There is already functionality to fetch files from the master if they are missing. So we could start the downloads after the metadata is replicated completely, unlike now where we wait for all the files in the current serial to finish downloading. Regards, Florian Schulze
participants (2)
-
Florian Schulze
-
Matthias Bach