[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
holger krekel
holger at merlinux.eu
Mon Mar 11 11:02:25 CET 2013
Hi Philip,
thanks for your helpful review, almost all makes sense to me ...
some more inlined comments below. Up front, i am open to you
co-authoring the PEP if you like and share the goal to find a minimum
viable approach to speed up and simplify the interactions for installers.
On Sun, Mar 10, 2013 at 15:41 -0400, PJ Eby wrote:
> On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <holger at merlinux.eu> wrote:
> > Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
> > scrutiny and feedback welcome.
>
> Hi Holger. I'm having some difficulty interpreting your proposal
> because it is leaving out some things, and in other places
> contradicting what I know of how the tools work. It is also a bit at
> odds with itself in some places.
Certainly, it was a quick draft to get the process going and useful
feedback which worked so far :)
> For instance, at the beginning, the PEP states its proposed solution
> is to host all release files on PyPI, but then the problem section
> describes the problems that arise from crawling external pages:
> problems that can be solved without actually hosting the files on
> PyPI.
>
> To me, it needs a clearer explanation of why the actual hosting part
> also needs to be on PyPI, not just the links. In the threads to date,
> people have argued about uptime, security, etc., and these points are
> not covered by the PEP or even really touched on for the most part.
Makes sense to clarify this more.
> (Actually, thinking about that makes me wonder.... Donald: did your
> analysis collect any stats on *where* those externally hosted files
> were hosted? My intuition says that the bulk of the files (by *file
> count*) will come from a handful of highly-available domains, i.e.
> sourceforge, github, that sort of thing, with actual self-hosting
> being relatively rare *for the files themselves*, vs. a much wider
> range of domains for the homepage/download URLs (especially because
> those change from one release to the next.) If that's true, then most
> complaints about availability are being caused by crawling multiple
> not-highly-available HTML pages, *not* by the downloading of the
> actual files. If my intuition about the distribution is wrong, OTOH,
> it would provide a stronger argument for moving the files themselves
> to PyPI as well.)
>
> Digression aside, this is one of things that needs to be clearer so
> that there's a better explanation for package authors as to why
> they're being asked to change. And although the base argument is good
> ("specifying the "homepage" will slow down the installation process"),
> it could be amplified further with an example of some project that has
> had multiple homepages over its lifetime, listing all the URLs that
> currently must be crawled before an installer can be sure it has found
> all available versions, platforms, and formats of the that project.
Right, an example makes sense.
> Okay, on to the Solution section. Again, your stated problem is to
> fix crawling, but the solution is all about file hosting. Regardless
> of which of these three "hosting modes" is selected, it remains an
> option for the developer to host files elsewhere, and provide the
> links in their description... unless of course you intended to rule
> that out and forgot to mention it. (Or, I suppose, if you did *not*
> intend to rule it out and intentionally omitted mention of that so the
> rabid anti-externalists would think you were on their side and not
> create further controversy... in which case I've now spoiled things.
> Darn. ;-) )
To be honest, while drafting i forgot about the fact that the
long_description can contain package links as well.
> Some technical details are also either incorrect or confusing. For
> example, you state that "The original homepage/download links are
> added as links without a ``rel`` attribute if they have the ``#egg``
> format". But if they are added without a rel attribute, it doesn't
> *matter* whether they have an #egg marker or not. It is quite
> possible for a PyPI package to have a download_url of say,
> "http://sourceforge.net/download/someproject-1.2.tgz".
Right. I just wanted to clarify that the distutils metadata
"download_url" can contain an #egg format link and that this link
should still be served (without a rel).
> Thus, I would suggest simply stating that changing hosting mode does
> not actually remove any links from the /simple index, it just removes
> the rel="" attributes from the "Home page" and "Download" links, thus
> preventing them from being crawled in search of additional file links.
That's certainly a better description of what effectively happens
and avoids the special mentioning of #egg.
> With that out of the way, that brings me to the larger scope issue
> with the modes as presented. Notice now that with this clarification,
> there is no real difference in *state* between the "pypi-cache" and
> "pypi-only" modes. There is only a *functional* difference... and
> that function is underspecified in the PEP.
Agreed.
> What I mean is, in both pypi-cache and pypi-only, the *state* of
> things is that rel="" attributes are gone, and there are links to
> files on PyPI. The only difference is in *how* the files get there.
Yes.
> And for the pypi-cache mode, this function is *really*
> under-specified. Arguably, this is the meat of the proposal, but it
> is entirely missing. There is nothing here about the frequency of
> crawling, the methods used to select or validate files, whether there
> is any expiration... it is all just magically assumed to happen
> somehow.
I'd like to avoid cache-invalidation issues by only performing cache
updates upon three user actions:
- when a release is registered for a package which is in
"pypi-cache" hosting mode.
- when a maintainer chooses to set "pypi-cache"
- when a maintainer explicitely triggers a "cache" update
All actions allow pypi.python.org to provide feedback / error codes
so there is nothing hidden going on in regular intervals or so.
> My suggestion would be to do two things:
>
> First, make the state a boolean: crawl external links, with the
> current state yes and the future state no, with "no" simply meaning
> that the rel="" attribute is removed from the links that currently
> have it.
>
> Second, propose to offer tools in the PyPI interface (and command
> line) to assist authors in making the transition, rather than
> proposing a completely unspecified caching mechanism. Better to have
> some vaguely specified tools than a completely unspecified caching
> mechanism, and better still to spell out very precisely what those
> tools do.
This structure makes sense to me except that i see the need to start off with
"pypi-ext", i.e. a third state which encodes the current behaviour.
Thing is that the pypi.python.org doesn't have an extensive test
suite and we will thus need to rely on a few early adopters
using the tools/state-changes before starting phase 2 (mass mailings etc.).
Also in case of problems we can always switch back packages to the safe
"pypi-ext" mode. IOW, the motiviation for this third state is considering
the actual implementation process.
> Okay, on to the "Phases of transtion". This section gets a lot
> simpler if there are only two stages. Specifically, we let everyone
> know the change is going to happen, and how long they have, give 'em
> links to migration tools. Done. ;-)
>
> (Okay, so analysis still makes sense: the people who don't have any
> externally hosted files can get a different message, i.e., "Hey, we
> notice that installing your package is slow because you have these
> links that don't go anywhere. Click here if you'd like PyPI to stop
> sending people on wild goose chases". The people who have external
> hosted files will need a more involved message.)
>
> Whew. Okay, that ends my critique of the PEP as it sits. Now for an
> outside-the-box suggestion.
>
> If you'd like to be able to transition people away from spidered links
> in the fewest possible steps, with the least user action, no legal
> issues, and in a completely automated way, note that this can be done
> with a one-time spidering of the existing links to find the download
> links, then adding those links directly to the /simple index, and
> switching off the rel="" attributes. This can be done without
> explicit user consent, though they can be given the chance to do it
> manually, sooner.
Right, my mail preceding the "pre-pep" one contained a "linkext" state
which spidered the links and offered them directly. It's certainly possible
and indeed would likely not have legal issues. It might have
cache-invalidation issues and probably makes the pypi-side implementation
more complex. Also it goes a bit against the current intention of the
PEP to have pypi.python.org control all hosting of release files.
> To implement this you'd need two project-level (*not* release-level)
> fields: one to indicate whether the project is using rel="" or not,
> and one to contain the list of external download links, which would be
> user-editable.
>
> This overall approach I'm proposing can be extended to also support
> mirroring, since it provides an explicit place to list what it is
> you're mirroring. (At any rate, it's more explicitly specified than
> any such place in the current PEP.)
>
> That field can also be fairly easily populated for any given project
> in just a few lines of code:
>
> from pkg_resources import Requirement
> pr = Requirement.parse('Projectname')
> from setuptools.package_index import PackageIndex
> pi = PackageIndex(search_path=[], python=None, platform=None)
> pi.find_packages(pr)
> all_urls = dist.location for dist in pi[pr.key]
> external_urls = [ url for url in all_urls if not '//pypi.python.org' in url]
>
> (Although if you want more information, like what kind of link each
> one is, the dist objects actually know a bit more than just the URL.)
>
> Anyway, I hope you found at least some of all this helpful. ;-)
Certainly! Will try to do an update incorporating your suggestions
in the next days.
best,
holger
More information about the Catalog-SIG
mailing list