[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site

Sun Mar 10 21:59:14 CET 2013

On Mar 10, 2013, at 3:41 PM, PJ Eby <pje at telecommunity.com> wrote:

> On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <holger at merlinux.eu> wrote:
>> Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
>> scrutiny and feedback welcome.
> 
> Hi Holger.  I'm having some difficulty interpreting your proposal
> because it is leaving out some things, and in other places
> contradicting what I know of how the tools work.  It is also a bit at
> odds with itself in some places.
> 
> For instance, at the beginning, the PEP states its proposed solution
> is to host all release files on PyPI, but then the problem section
> describes the problems that arise from crawling external pages:
> problems that can be solved without actually hosting the files on
> PyPI.
> 
> To me, it needs a clearer explanation of why the actual hosting part
> also needs to be on PyPI, not just the links.  In the threads to date,
> people have argued about uptime, security, etc., and these points are
> not covered by the PEP or even really touched on for the most part.
> 
> (Actually, thinking about that makes me wonder....  Donald: did your
> analysis collect any stats on *where* those externally hosted files
> were hosted?  My intuition says that the bulk of the files (by *file
> count*) will come from a handful of highly-available domains, i.e.
> sourceforge, github, that sort of thing, with actual self-hosting
> being relatively rare *for the files themselves*, vs. a much wider
> range of domains for the homepage/download URLs (especially because
> those change from one release to the next.)  If that's true, then most
> complaints about availability are being caused by crawling multiple
> not-highly-available HTML pages, *not* by the downloading of the
> actual files.  If my intuition about the distribution is wrong, OTOH,
> it would provide a stronger argument for moving the files themselves
> to PyPI as well.)

No but it wouldn't be difficult to take the list of packages I generated and run another script to see where the files that aren't available on PyPI are actually located at. I'd like to emphasize again though that it doesn't really matter how good their uptime is, the best case scenario is it doens't hurt uptime, and worst case and typical case) is that it decreases it. A high uptime host will just decrease it _less_ than a low uptime host.

> 
> Digression aside, this is one of things that needs to be clearer so
> that there's a better explanation for package authors as to why
> they're being asked to change.  And although the base argument is good
> ("specifying the "homepage" will slow down the installation process"),
> it could be amplified further with an example of some project that has
> had multiple homepages over its lifetime, listing all the URLs that
> currently must be crawled before an installer can be sure it has found
> all available versions, platforms, and formats of the that project.
> 
> Okay, on to the Solution section.  Again, your stated problem is to
> fix crawling, but the solution is all about file hosting.  Regardless
> of which of these three "hosting modes" is selected, it remains an
> option for the developer to host files elsewhere, and provide the
> links in their description...  unless of course you intended to rule
> that out and forgot to mention it.  (Or, I suppose, if you did *not*
> intend to rule it out and intentionally omitted mention of that so the
> rabid anti-externalists would think you were on their side and not
> create further controversy...  in which case I've now spoiled things.
> Darn.  ;-) )
> 
> Some technical details are also either incorrect or confusing.  For
> example, you state that "The original homepage/download links are
> added as links without a ``rel`` attribute if they have the ``#egg``
> format".  But if they are added without a rel attribute, it doesn't
> *matter* whether they have an #egg marker or not.  It is quite
> possible for a PyPI package to have a download_url of say,
> "http://sourceforge.net/download/someproject-1.2.tgz".
> 
> Thus, I would suggest simply stating that changing hosting mode does
> not actually remove any links from the /simple index, it just removes
> the rel="" attributes from the "Home page" and "Download" links, thus
> preventing them from being crawled in search of additional file links.

In my opinion the final, PyPI only mode needs to remove all external links from the /simple/ index.

> 
> With that out of the way, that brings me to the larger scope issue
> with the modes as presented.  Notice now that with this clarification,
> there is no real difference in *state* between the "pypi-cache" and
> "pypi-only" modes.  There is only a *functional* difference...  and
> that function is underspecified in the PEP.
> 
> What I mean is, in both pypi-cache and pypi-only, the *state* of
> things is that rel="" attributes are gone, and there are links to
> files on PyPI.  The only difference is in *how* the files get there.
> 
> And for the pypi-cache mode, this function is *really*
> under-specified.  Arguably, this is the meat of the proposal, but it
> is entirely missing.  There is nothing here about the frequency of
> crawling, the methods used to select or validate files, whether there
> is any expiration...  it is all just magically assumed to happen
> somehow.
> 
> My suggestion would be to do two things:
> 
> First, make the state a boolean: crawl external links, with the
> current state yes and the future state no, with "no" simply meaning
> that the rel="" attribute is removed from the links that currently
> have it.
> 
> Second, propose to offer tools in the PyPI interface (and command
> line) to assist authors in making the transition, rather than
> proposing a completely unspecified caching mechanism.  Better to have
> some vaguely specified tools than a completely unspecified caching
> mechanism, and better still to spell out very precisely what those
> tools do.
> 
> Okay, on to the "Phases of transtion".  This section gets a lot
> simpler if there are only two stages.  Specifically, we let everyone
> know the change is going to happen, and how long they have, give 'em
> links to migration tools.  Done.  ;-)

This is my opinion as well. Though I think we differ in what the final stage should look like.

> 
> (Okay, so analysis still makes sense: the people who don't have any
> externally hosted files can get a different message, i.e., "Hey, we
> notice that installing your package is slow because you have these
> links that don't go anywhere.  Click here if you'd like PyPI to stop
> sending people on wild goose chases".  The people who have external
> hosted files will need a more involved message.)
> 
> Whew.  Okay, that ends my critique of the PEP as it sits.  Now for an
> outside-the-box suggestion.
> 
> If you'd like to be able to transition people away from spidered links
> in the fewest possible steps, with the least user action, no legal
> issues, and in a completely automated way, note that this can be done
> with a one-time spidering of the existing links to find the download
> links, then adding those links directly to the /simple index, and
> switching off the rel="" attributes.  This can be done without
> explicit user consent, though they can be given the chance to do it
> manually, sooner.
> 
> To implement this you'd need two project-level (*not* release-level)
> fields: one to indicate whether the project is using rel="" or not,
> and one to contain the list of external download links, which would be
> user-editable.
> 
> This overall approach I'm proposing can be extended to also support
> mirroring, since it provides an explicit place to list what it is
> you're mirroring.  (At any rate, it's more explicitly specified than
> any such place in the current PEP.)
> 
> That field can also be fairly easily populated for any given project
> in just a few lines of code:
> 
>    from pkg_resources import Requirement
>    pr = Requirement.parse('Projectname')
>    from setuptools.package_index import PackageIndex
>    pi = PackageIndex(search_path=[], python=None, platform=None)
>    pi.find_packages(pr)
>    all_urls = dist.location for dist in pi[pr.key]
>    external_urls = [ url for url in all_urls if not '//pypi.python.org' in url]
> 
> (Although if you want more information, like what kind of link each
> one is, the dist objects actually know a bit more than just the URL.)
> 
> Anyway, I hope you found at least some of all this helpful.  ;-)
> _______________________________________________
> Catalog-SIG mailing list
> Catalog-SIG at python.org
> http://mail.python.org/mailman/listinfo/catalog-sig

I'm still against any off PyPI hosting of files. I call it "External links" a lot but in reality it's the requirement to contact any host other than PyPI to install a package.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20130310/dbc5eb29/attachment.pgp>