[Catalog-sig] Migrating away from scanning home pages (was: Deprecate External Links)

M.-A. Lemburg mal at egenix.com
Thu Feb 28 11:55:14 CET 2013

I think we all agree that scanning arbitrary HTML pages
for download links is not a good idea and we need to
transition away from this towards a more reliable system.

Here's an approach that would work to start the transition
while not breaking old tools (sketching here to describe the
basic idea):

Limiting scans to download_url

Installers and similar tools preferably no longer scan the all
links on the /simple/ index, but instead only look at
the download links (which can be defined in the package
meta data) for packages that don't host files on PyPI.

Going only one level deep

If the download links point to a meta-file named
the installers download that file, check whether the
hash value matches and if it does, scan the file in
the same way they would parse the /simple/ index page of
the package - think of the downloads.html file as a symlink
to extend the search to an external location, but in a
predefined and safe way.


* The creation of the downloads.html file is left to the
  package owner (we could have a tool to easily create it).

* Since the file would use the same format as the PyPI
  /simple/ index directory listing, installers would be
  able to verify the embedded hash values (and later
  GPG signatures) just as they do for files hosted directly
  on PyPI.

* The URL of the downloads.html file, together with the
  hash fragment, would be placed into the setup.py
  download_url variable. This is supported by all recent
  and not so recent Python versions.

* No changes to older Python versions of distutils are
  necessary to make this work, since the download_url
  field is a free form field.

* No changes to existing distutils meta data formats are
  necessary, since the download_url field has always
  been meant for download URLs.

* Installers would not need to learn about a new meta
  data format, because they already know how to parse
  PyPI style index listings.

* Installers would prefer the above approach for downloads,
  and warn users if they have to revert back to the old
  method of scanning all links.

* Installers could impose extra security requirements,
  such as only following HTTPS links and verifying
  all certificates.

* In a later phase of the transition we could have
  PyPI cache the referenced distribution files locally
  to improve reliability. This would turn the push
  strategy for uploading files to PyPI into a pull
  strategy for those packages and make things a lot
  easier to handle for package maintainers.

What do you think ?

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Feb 28 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Catalog-SIG mailing list