[Catalog-sig] homepage/download metadata cleaning

PJ Eby pje at telecommunity.com
Sat Mar 2 06:08:47 CET 2013

On Fri, Mar 1, 2013 at 6:02 PM, holger krekel <holger at merlinux.eu> wrote:
> On Fri, Mar 01, 2013 at 23:50 +0100, Lennart Regebro wrote:
>> On Fri, Mar 1, 2013 at 8:31 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> > Hmm, then why not remove links that don't match the above from
>> > the /simple/ index pages ?
>> I think we can do that, but if we *start* with that, we will just
>> suddenly, with no warning, break everything.
>> Its' better if the installation tools can first warn, then remove
>> their support for this, and *then* we remove these links from
>> /simple/.
> I think Marc-Andre was just refering to the superflous links
> from the long-description, namely all links which don't match
> the #egg format and don't have a rel of download/homepage.
> Phillip clarified that pypi served all long-description links at the
> time to leave it to the tools to interpret them.  The interpretation is
> now pretty clear and so pypi doesn't need to provide them.  It shouldn't
> break neither pip nor easy_install to remove those unused long-description
> links.

Provided, of course, that PyPI follows the *exact same* interpretation
of what is and isn't an unused link.  Since unused links do no harm,
there is correspondingly no benefit to writing code to remove them,
that might introduce bugs.

To be clear, what I have proposed is simply removing the rel=""
attributes from the special links on hidden releases.  This will
prevent scraping of outdated home pages or download pages, but tools
will still be able to use a download or home page link that points to
an actual downloadable file or source checkout.

What would also be useful to have before that time, would be a tool to
let people either update their description links with direct external
links, or optionally upload the contents of those links instead...
preferably offered via a couple of buttons in PyPI's UI, as well as a
standalone tool or setup.py command to initiate the process remotely
or as part of a release process.  (Preferably, these tools would be
offered to authors *before* the date when the rel="" attributes would
be pulled from PyPI, of course.)

(In principle, we could make it even easier by just automatically
scraping the links and adding them to the descriptions (or some new
PyPI field for "external download links") of such releases, but I
think some kind of affirmative consent is probably in order, just to
avoid ruffling any feathers.)

Anyway, if the direct external links carry #md5 hashes, they'll be
slightly more secure and the "expired domain supplying fake links"
issue won't apply.

The final step in the process would be to drop the rel="" attributes
from *all* releases, not just hidden ones.  At that point, it wouldn't
be possible to download from an external site unless the author has
provided a direct download link, rather than a link to a page
containing download links.

We could then look at uptake on the use of the pull-uploader, and
feedback from package authors, to see whether dropping the remaining
external links and serving everything from PyPI is a viable option.

More information about the Catalog-SIG mailing list