[Catalog-sig] homepage/download metadata cleaning

M.-A. Lemburg mal at egenix.com
Fri Mar 1 12:04:24 CET 2013


On 01.03.2013 11:19, holger krekel wrote:
> Hi Richard, all,
> 
> somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
> script which takes a project name as an argument and then goes to 
> pypi.python.org and removes all homepage/download metadata entries for 
> this project.  This sanitizes/speeds up installation because
> pip/easy_install don't need to crawl them anymore.  I just did this for
> three of my projects, (pytest, tox and py) and it seems to work fine.

Does it also cleanup the links that PyPI adds to the /simple/ by
parsing the project description for links ?

I think those are far nastier than the homepage and download links,
which can be put to some good use to limit the external lookups
(see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)

See e.g. https://pypi.python.org/simple/zc.buildout/
for a good example of the mess this generates... even mailto links
get listed and "file:///" links open up the installers for all
kinds of nasty things (unless they explicitly protect against
following these).

> Now before i release this as a tool, i wonder: Is it a good idea to remove
> download/homepage entries?  Is there any current machine use (other than
> the dreaded crawling) for the homepage/download_url per-release metadata 
> fields?
> 
> For humans the homepage link is nicely discoverable if the long-description
> doesn't mention it prominently.  But i think there also is a "project url" 
> or "bugtrack url" for a project so maybe those could be used to reference 
> these important pages?  (i am a bit confused on the exact meaning of those
> urls, btw).
> 
> Should we maybe stop advertising "homepage" and "download_url"
> and instead see to extend project-url/bugtrackurl to be used
> and shown nicely? The latter are independent of releases which i think
> makes sense - what use are old probably unreachable/borked homepages
> anyway.  And it's also not too bad having to go once to pypi.python.org
> to set it, usually it seldomly changes.

I think it would be better to differentiate between showing the
fields on the project pages, where they provide useful resources
for people, and their use on the /simple/ index pages which are
meant for programs to parse.

IMO, the homepage and download links on the project pages are
indeed very useful for people. On the /simple/ index a homepage
link is probably not all that useful (provided a download link
is set). The download links serve the purpose of directing
tools to the right location, so those do belong on the /simple/
index listings. I'd completely remove the links parsed from
the descriptions, since those don't really provide a good
basis for crawling (the description is meant for humans to
parse, not programs).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 01 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Catalog-SIG mailing list