[Catalog-sig] Migrating away from scanning home pages (was: Deprecate External Links)

M.-A. Lemburg mal at egenix.com
Thu Feb 28 11:55:14 CET 2013


I think we all agree that scanning arbitrary HTML pages
for download links is not a good idea and we need to
transition away from this towards a more reliable system.

Here's an approach that would work to start the transition
while not breaking old tools (sketching here to describe the
basic idea):

Limiting scans to download_url
------------------------------

Installers and similar tools preferably no longer scan the all
links on the /simple/ index, but instead only look at
the download links (which can be defined in the package
meta data) for packages that don't host files on PyPI.

Going only one level deep
-------------------------

If the download links point to a meta-file named
"<packagename>-<version>-downloads.html#<sha256-hashvalue>",
the installers download that file, check whether the
hash value matches and if it does, scan the file in
the same way they would parse the /simple/ index page of
the package - think of the downloads.html file as a symlink
to extend the search to an external location, but in a
predefined and safe way.

Comments
--------

* The creation of the downloads.html file is left to the
  package owner (we could have a tool to easily create it).

* Since the file would use the same format as the PyPI
  /simple/ index directory listing, installers would be
  able to verify the embedded hash values (and later
  GPG signatures) just as they do for files hosted directly
  on PyPI.

* The URL of the downloads.html file, together with the
  hash fragment, would be placed into the setup.py
  download_url variable. This is supported by all recent
  and not so recent Python versions.

* No changes to older Python versions of distutils are
  necessary to make this work, since the download_url
  field is a free form field.

* No changes to existing distutils meta data formats are
  necessary, since the download_url field has always
  been meant for download URLs.

* Installers would not need to learn about a new meta
  data format, because they already know how to parse
  PyPI style index listings.

* Installers would prefer the above approach for downloads,
  and warn users if they have to revert back to the old
  method of scanning all links.

* Installers could impose extra security requirements,
  such as only following HTTPS links and verifying
  all certificates.

* In a later phase of the transition we could have
  PyPI cache the referenced distribution files locally
  to improve reliability. This would turn the push
  strategy for uploading files to PyPI into a pull
  strategy for those packages and make things a lot
  easier to handle for package maintainers.

What do you think ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 28 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Catalog-SIG mailing list