DevPI Replication Architecture - A Radical Proposal
The DevPI replication architecture has served the community well. It's enabled some pretty powerful and robust deployments. But as Florian has detailed in his recent posts (namely https://mail.python.org/mm3/archives/list/devpi-dev@python.org/thread/VLFAGR...), DevPI is also under pressure to scale with the vast and growing index of packages. Given this challenge and the challenges brought by the 12-factor milestone (https://github.com/devpi/devpi/issues?q=is%3Aopen+is%3Aissue+milestone%3Atwe...), I can't help but think about another architecture that might dramatically simplify the management and scaling issues and address 12-factor with one fundamental change: use MongoDB as the (sole) persistence backend. MongoDB has some nice properties that might make it a perfect backend for DevPI: - Replication is intrinsic to the design. Instead of implementing its own replication logic, devpi-server could be a simple service connected to whatever MongoDB instance or instances are appropriate and leverage the robust replication in MongoDB to address fault tolerance and local availability. - Robust support for files through GridFS - MongoDB could store the metadata and the resources themselves in one distributed data store. - High performance - MongoDB performs extremely well. - Full-text index - MongoDB has integrated full-text indexing, which would obviate whoosh with a much faster indexing engine. - Efficient compressed storage. - Strong separation of concerns; devpi deals with application logic, database deals with persistence and replication logic. - Many of the data management operations could be supported without any devpi code (backups, archives, defrag). Although I see some advantages, I also anticipate some disadvantages: - Each deployment must run MongoDB. Although it's trivially easy to download and run a MongoDB server, there's no embedded version, so something would need to orchestrate the install, setup, and teardown of an instance in non-production scenarios (to support developer mode). - Creating a developer or offline replica of a production system would require special handling and maybe trickery, but I would expect this use-case could be refined to also leverage the MongoDB replication methodology as well. - Such a design would require any other storage backends to be comparably capable and wouldn't allow for replication across different storage backends. There are probably other issues I haven't yet conceived of or considered. I could imagine two possible approaches to this. One is extremely disruptive, where the persistence logic is largely replaced with this new methodology, and a migration routine is written to migrate systems on the older paradigm to the new, and support for the old paradigm is dropped. A less disruptive approach could be to create a new abstraction layer, one in which the persistence backend is still pluggable, but that layer becomes responsible for the entirety of persistence, including file storage, metadata storage, replication, and indexing/searching... and one possible implementation of this backend is the MongoDB backend and another is the SQL/file system/whoosh/python replication functionality based on the existing kit. The biggest architectural deviation I'm proposing here is that the replication functionality be pushed down into the persistence layer rather than operating at the application layer. Would the project consider this approach? Are there any use cases I've missed in my consideration that would be adversely affected by such an approach? What would it take for you to be enthusiastic about this change?
Hi, On Tue, Jul 24, 2018 at 02:57:23PM -0000, Jason R. Coombs wrote:
MongoDB has some nice properties that might make it a perfect backend for DevPI:
I don't actually know MongoDB very well, but the fact that it used to default to listening on any interface without any authentication by default[1] left a sour taste... If things seem better nowadays, nevermind. [1] https://snyk.io/blog/mongodb-hack-and-secure-defaults/
- Each deployment must run MongoDB. Although it's trivially easy to download and run a MongoDB server, there's no embedded version, so something would need to orchestrate the install, setup, and teardown of an instance in non-production scenarios (to support developer mode).
One thing I really like devpi for (and some people I know do as well) is that it's trivial to start on your laptop and just use it as a PyPI cache[2]. This would probably get a bit more difficult (or a lot, depending on OS/distribution, I'd guess). [2] https://devpi.net/docs/devpi/devpi/stable/%2Bd/index.html Florian (Bruhin, not Schulze) -- https://www.qutebrowser.org | me@the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/
On 24 Jul 2018, at 16:57, Jason R. Coombs wrote:
A less disruptive approach could be to create a new abstraction layer, one in which the persistence backend is still pluggable, but that layer becomes responsible for the entirety of persistence, including file storage, metadata storage, replication, and indexing/searching... and one possible implementation of this backend is the MongoDB backend and another is the SQL/file system/whoosh/python replication functionality based on the existing kit.
I took another look at what we have and I think this is totally feasible. The "KeyFS" + "FileStorage" interface we have internally has very little API and it looks like it could be separated. We would provide a new plugin one layer up from the current "Storage" plugins we have. I have no experience with MongoDB and only very little with NoSQL. That's another reason I like this less disruptive approach, as it could mature on its own and won't disrupt existing installations while it does.
The biggest architectural deviation I'm proposing here is that the replication functionality be pushed down into the persistence layer rather than operating at the application layer.
That's already the case, but the current Storage plugins are below that layer.
Would the project consider this approach? Are there any use cases I've missed in my consideration that would be adversely affected by such an approach? What would it take for you to be enthusiastic about this change?
I'm all for the less disruptive path, as it would also allow deeper changes in the currently existing backends, like using more relational patterns in the SQL backends and search with postgresql. Regards, Florian Schulze
participants (3)
-
Florian Bruhin
-
Florian Schulze
-
Jason R. Coombs