Does anyone have a pointer on what causes this error?
AttributeError: 'FlakesItem' object has no attribute '_collectfile'
Happens for me with devpi-server on Travis and I can't reproduce it
I already reinstalled all my development dependencies with the latest
versions to see if a new package release causes it.
The DevPI replication architecture has served the community well. It's enabled some pretty powerful and robust deployments.
But as Florian has detailed in his recent posts (namely https://email@example.com/thread/VLFAG…), DevPI is also under pressure to scale with the vast and growing index of packages.
Given this challenge and the challenges brought by the 12-factor milestone (https://github.com/devpi/devpi/issues?q=is%3Aopen+is%3Aissue+milestone%3Atw…), I can't help but think about another architecture that might dramatically simplify the management and scaling issues and address 12-factor with one fundamental change: use MongoDB as the (sole) persistence backend.
MongoDB has some nice properties that might make it a perfect backend for DevPI:
- Replication is intrinsic to the design. Instead of implementing its own replication logic, devpi-server could be a simple service connected to whatever MongoDB instance or instances are appropriate and leverage the robust replication in MongoDB to address fault tolerance and local availability.
- Robust support for files through GridFS - MongoDB could store the metadata and the resources themselves in one distributed data store.
- High performance - MongoDB performs extremely well.
- Full-text index - MongoDB has integrated full-text indexing, which would obviate whoosh with a much faster indexing engine.
- Efficient compressed storage.
- Strong separation of concerns; devpi deals with application logic, database deals with persistence and replication logic.
- Many of the data management operations could be supported without any devpi code (backups, archives, defrag).
Although I see some advantages, I also anticipate some disadvantages:
- Each deployment must run MongoDB. Although it's trivially easy to download and run a MongoDB server, there's no embedded version, so something would need to orchestrate the install, setup, and teardown of an instance in non-production scenarios (to support developer mode).
- Creating a developer or offline replica of a production system would require special handling and maybe trickery, but I would expect this use-case could be refined to also leverage the MongoDB replication methodology as well.
- Such a design would require any other storage backends to be comparably capable and wouldn't allow for replication across different storage backends.
There are probably other issues I haven't yet conceived of or considered.
I could imagine two possible approaches to this. One is extremely disruptive, where the persistence logic is largely replaced with this new methodology, and a migration routine is written to migrate systems on the older paradigm to the new, and support for the old paradigm is dropped.
A less disruptive approach could be to create a new abstraction layer, one in which the persistence backend is still pluggable, but that layer becomes responsible for the entirety of persistence, including file storage, metadata storage, replication, and indexing/searching... and one possible implementation of this backend is the MongoDB backend and another is the SQL/file system/whoosh/python replication functionality based on the existing kit.
The biggest architectural deviation I'm proposing here is that the replication functionality be pushed down into the persistence layer rather than operating at the application layer.
Would the project consider this approach? Are there any use cases I've missed in my consideration that would be adversely affected by such an approach? What would it take for you to be enthusiastic about this change?
The way indexing is currently implemented in devpi-web has some
One problem is the number of registered projects on PyPI which creates a
huge initial commit for new devpi instances which can take hours to
complete and currently causes reads from PyPI which can fail at any
point and abort the indexing.
The other problem is, that any change in a project, even just a new
version number causes a full reindex of the project, including it's full
documentation if it exists.
For the first issue I have a proposal, for the second I still have no
idea, but the impact might be reduced.
Use the following ticket to give thumbs up/down or whatever reaction,
but keep discussions in this mail for now.
tldr: Quick solution would be changing root/pypi from a "mirror" to a
"cache" which doesn't store anything in the DB and isn't replicated. The
harder solution would be changes to the replication protocol and DB
PyPI is growing quickly and that growth continues to affect the current
way devpi handles root/pypi. We already had to make changes in the past
due to the number of projects on PyPI.
First an overview of my understanding of why root/pypi works the way it
does in devpi:
In the past PyPI was unreliable. It was down or slow quite often and
that caused issues in day to day use. Several tools started adding
mitigations for that. For example zc.buildout was one of the first tools
that started caching data from PyPI locally to reduce download time and
make repeated installations quicker.
Because devpi already provided the necessary data for pip and other
tools, it made sense to cache PyPI data locally.
At some point replication was added to devpi. In regard to root/pypi,
the thinking was, that all data the master already had from PyPI should
be copied to the replicas, so all instances could provide the same view
and once an installation with pip has worked, it should continue to
work, regardless of which replica was used.
Now to the issues.
For replicas to be able to replicate all data easily and reliably
without having to download all data when out of sync, we use serialized
changesets. For the list of names from PyPI we stored that list in the
DB. Problem was that the list was stored in full each time it changed.
Because new projects are added all the time, that list took up quite a
bit of space in the DB. We then changed it by only keeping the list in
Now the next issue is the growing number of releases per project. We
store the infos from the links on the simple page of each project that
was installed in the past. Whenever it is accessed again, it is updated
when there are new releases. For a busy devpi instance that data can
grow quite large.
We also store all accessed release files.
With devpi-web we added indexing for search. Here the number of projects
starts to become an issue as well. A new devpi instance downloads the
names of all projects on PyPI and indexes them. As of this writing,
these are ~150000 names. Writing that index on the first commit takes
several minutes with the Whoosh backend we currently have.
Storage also becomes an issue. Long running devpi instances have a
constantly growing database and pile of package files. And currently
there is no official way to clean that up and the workarounds can cause
Now the question is where to go from here.
One idea is adding a "cache" index type, which doesn't replicate and
either doesn't index at all, or only indexes currently cached data. Such
an index would change behaviour of replicas, because each replica would
have a different state of cached data.
Another solution would involve changing the replication protocol or at
least some assumptions about it.
Currently replicas walk through all state changes of the master step by
step and redo everything that happend to get to the current state. This
is pretty wasteful most of the time. A new replica should get to the
current state as quickly as possible. The full metadata of the current
state isn't big. The biggest part are the release files. A new replica
could get the full data for the current serial and then follow the
individual changes from there. Fetching of the release files would work
the same as it does now, except the initial list would be much bigger.
Once that works, we could start removing data from old serials. This can
be a triggered operation like vacuum on databases and at a later point
it might be possible to automate it. The master already keeps track of
connected replicas. So it's pretty easy to check what the oldest synced
serial of the replicas is and remove older ones. New replicas would get
the full data of the current serial in one go. If there are replicas
which have been out of sync for a longer time, they would also get a
full set of metadata but can keep the release files they already have
and update them.
We could and most likely should limit these cleanups to the mirror
indexes and maybe deleted indexes.
Both solutions would solve the storage issue in different ways. The
biggest problem would still be the indexing. Which will be in another