[devpi-dev] DevPI Replication Architecture - A Radical Proposal

24 Jul 2018

      The DevPI replication architecture has served the community well. It's enabled some pretty powerful and robust deployments.

But as Florian has detailed in his recent posts (namely https://mail.python.org/mm3/archives/list/devpi-dev@python.org/thread/VLFAGR...), DevPI is also under pressure to scale with the vast and growing index of packages.

Given this challenge and the challenges brought by the 12-factor milestone (https://github.com/devpi/devpi/issues?q=is%3Aopen+is%3Aissue+milestone%3Atwe...), I can't help but think about another architecture that might dramatically simplify the management and scaling issues and address 12-factor with one fundamental change: use MongoDB as the (sole) persistence backend.

MongoDB has some nice properties that might make it a perfect backend for DevPI:

- Replication is intrinsic to the design. Instead of implementing its own replication logic, devpi-server could be a simple service connected to whatever MongoDB instance or instances are appropriate and leverage the robust replication in MongoDB to address fault tolerance and local availability.
- Robust support for files through GridFS - MongoDB could store the metadata and the resources themselves in one distributed data store.
- High performance - MongoDB performs extremely well.
- Full-text index - MongoDB has integrated full-text indexing, which would obviate whoosh with a much faster indexing engine.
- Efficient compressed storage.
- Strong separation of concerns; devpi deals with application logic, database deals with persistence and replication logic.
- Many of the data management operations could be supported without any devpi code (backups, archives, defrag).

Although I see some advantages, I also anticipate some disadvantages:

- Each deployment must run MongoDB. Although it's trivially easy to download and run a MongoDB server, there's no embedded version, so something would need to orchestrate the install, setup, and teardown of an instance in non-production scenarios (to support developer mode).
- Creating a developer or offline replica of a production system would require special handling and maybe trickery, but I would expect this use-case could be refined to also leverage the MongoDB replication methodology as well.
- Such a design would require any other storage backends to be comparably capable and wouldn't allow for replication across different storage backends.

There are probably other issues I haven't yet conceived of or considered.

I could imagine two possible approaches to this. One is extremely disruptive, where the persistence logic is largely replaced with this new methodology, and a migration routine is written to migrate systems on the older paradigm to the new, and support for the old paradigm is dropped.

A less disruptive approach could be to create a new abstraction layer, one in which the persistence backend is still pluggable, but that layer becomes responsible for the entirety of persistence, including file storage, metadata storage, replication, and indexing/searching... and one possible implementation of this backend is the MongoDB backend and another is the SQL/file system/whoosh/python replication functionality based on the existing kit.

The biggest architectural deviation I'm proposing here is that the replication functionality be pushed down into the persistence layer rather than operating at the application layer.

Would the project consider this approach? Are there any use cases I've missed in my consideration that would be adversely affected by such an approach? What would it take for you to be enthusiastic about this change?

[devpi-dev] DevPI Replication Architecture - A Radical Proposal

Jason R. Coombs