[Catalog-sig] homepage/download metadata cleaning

M.-A. Lemburg mal at egenix.com
Fri Mar 1 20:31:28 CET 2013


On 01.03.2013 12:17, holger krekel wrote:
> On Fri, Mar 01, 2013 at 06:09 -0500, Donald Stufft wrote:
>> On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote:
>>> On 01.03.2013 11:19, holger krekel wrote:
>>>> Hi Richard, all,
>>>>
>>>> somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
>>>> script which takes a project name as an argument and then goes to 
>>>> pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for 
>>>> this project. This sanitizes/speeds up installation because
>>>> pip/easy_install don't need to crawl them anymore. I just did this for
>>>> three of my projects, (pytest, tox and py) and it seems to work fine.
>>>>
>>>
>>>
>>> Does it also cleanup the links that PyPI adds to the /simple/ by
>>> parsing the project description for links ?
>>>
>>> I think those are far nastier than the homepage and download links,
>>> which can be put to some good use to limit the external lookups
>>> (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)
>>>
>>> See e.g. https://pypi.python.org/simple/zc.buildout/
>>> for a good example of the mess this generates... even mailto links
>>> get listed and "file:///" links open up the installers for all
>>> kinds of nasty things (unless they explicitly protect against
>>> following these).
>>>
>>>
>>
>> pip at least, and I assume the other tools don't spider those links, but
>> they do consider them for download (e.g. if the link looks installable
>> it will be a candidate for installing, but  it won't fetch it, and look for 
>> more links like it will donwnload_url/home_page).
>>
>> I believe that's the way it's structured atm.
> 
> That's right. Even though the long-description extracted links 
> look ugly on a simple/PKGNAME page, neither pip nor easy_install do anything
> with them except if the "href" ends in "#egg=PKGNAME-" in which case they are
> taken as pointing to a development tarball (e.g. at github or bitbucket).
> ASFAIK a link like "PKGNAME-VER.tar.gz" will not be treated as
> an installation candidate, just the "#egg=PKGNAME" one.

Hmm, then why not remove links that don't match the above from
the /simple/ index pages ?

Note that it's easily possible to make e.g. file:/// links
have a fragment that matches what you described, so I guess the
filters would have to be more careful about what to allow
(e.g. only http/ftp schemes, perhaps even only https schemes)
and what not.

BTW: Are those links also shown as-is on the description page ?
People could do nasty stuff by adding "javascript:" links which look
like normal links to the descriptions.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 01 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Catalog-SIG mailing list