On Wed, 5 Jul 2023 at 17:00, Christopher Barker <pythonchb@gmail.com> wrote:
I'm noting this, because I think it's part of the problem to be solved, but maybe not the mainone (to me anyway). I've been focused more on "these packages are worthwhile, by some definition of worthwhile). While I think Chris A is more focused on "which of these seemingly similar packages should I use?" -- not unrelated, but not the same question either.
Indeed, not the same question; but "some definition of worthwhile" is the crucial point here. If there is one single curated package index of "worthwhile" packages, who decides what's on it and what's not? If not everyone can agree, will there have to be multiple such listings?
Technically, conda is similar to pip -- it has a default "channel" (a channel is an indexed repository of packages) it points to, and you can point it to a different one, or any number of others, or install a single package from a particular channel.
Socially, it's pretty different - There is no channel like PyPi that anyone can put anything on willy nilly. - The default channel is operated by Anaconda.com -- and no one else can put any thing on there. (they take suggestions, but it's a pretty big lift to get them to add a package) - The protocol for a channel is pretty simple -- all you really need is an http server, but in practice, most folks host their channels on the Anaconda.org server -- it's a free service that anyone can create a channel on -- there are a LOT -- folks use them for their personal projects, etc.
So, high barrier to entry. Good to know. That's neither good nor bad inherently, but it is a point of note.
- Then there is conda-forge: It grew out of an effort to collaborate among a number of folks operating channels focused on particular fields -- met/ocean science, astronomy, computational biology, ... we all had different needs, but they overlapped -- why not share resources? Thanks to the heroic efforts of a few folks, it grew to what it is now: a gitHub and CI -based conda package build system that published a conda channel on anaconda.org with over 22,000 (wow! I think I'm reading that right) packages.
(https://anaconda.org/conda-forge/repo)
They are curated -- anyone can propose a new package (via PR) -- but it only gets added once it's been reviewed and approved by the core team. Curation wasn't the goal, but it's necessary in order to have any hope that they will all work together. The review process is really of the package, not the code in the package (is it built correctly? is it compatible with the rest of conda-forge? Does it include the license file? Is there a maintainer? ...) But the end result is a fair bit of curation -- users can be assured that: 1 - The package works 2 - The package is useful enough that someone took the time to get it up there. 3 - It's very unlikely to be malware (I don't think the conda-forge policy really looks hard for that, but typosquatting and that sort of thing are pretty much impossible.
Cool. The trouble is, point 1 is nearly impossible to assure except in the very narrowest of definitions, and point 2's value correlates with the height of the barrier to entry, so it's a fairly strict tradeoff. And unless that barrier is extremely high, there will always be the possibility that someone puts in the effort to get malware pushed, although it does become vanishingly improbable.
What about OS package managers like the Debian repositories?
I have no idea, other than that the majors, at least, put a LOT of work into having a pretty comprehensive base repository of "vetted" packages
Right; hence the question of how a "vetted Python package collection" would compare. I can type "sudo apt install python-" and add the name of a package, and I get some assurance that: 1) The package works 2) The package is useful enough 3) It's not malware 4) The specific *version* of the package works along with the versions of everything else. This is a very strong set of criteria, much stronger than we'd be looking for here, as they come with correspondingly higher barriers to entry (getting a package update into the Debian repositories becomes harder and harder as the release date approaches).
conda-forge has about 22,121 -- that's enough to be very useful, but a lot of use-cases are not well covered, and I know I still have to contribute one once in a while.
Looking now -- PyPi has 465,295 projects more than 20 times as many -- I wonder how many of those are "useful"?
Contrariwise, the Debian repository has under a thousand "python-*" packages, but with a much stronger requirement that they be useful. It's interesting that there are only twenty on PyPI for every one on conda-forge. I would have expected a greater ratio. It seems that conda-forge is able to be incomplete AND dauntingly large; how successful would you be at guessing a package name based on a desired goal? ChrisA