[Catalog-sig] homepage/download metadata cleaning
donald.stufft at gmail.com
Fri Mar 1 12:09:54 CET 2013
On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote:
> On 01.03.2013 11:19, holger krekel wrote:
> > Hi Richard, all,
> > somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
> > script which takes a project name as an argument and then goes to
> > pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for
> > this project. This sanitizes/speeds up installation because
> > pip/easy_install don't need to crawl them anymore. I just did this for
> > three of my projects, (pytest, tox and py) and it seems to work fine.
> Does it also cleanup the links that PyPI adds to the /simple/ by
> parsing the project description for links ?
> I think those are far nastier than the homepage and download links,
> which can be put to some good use to limit the external lookups
> (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)
> See e.g. https://pypi.python.org/simple/zc.buildout/
> for a good example of the mess this generates... even mailto links
> get listed and "file:///" links open up the installers for all
> kinds of nasty things (unless they explicitly protect against
> following these).
pip at least, and I assume the other tools don't spider those links, but
they do consider them for download (e.g. if the link looks installable
it will be a candidate for installing, but it won't fetch it, and look for
more links like it will donwnload_url/home_page).
I believe that's the way it's structured atm.
> > Now before i release this as a tool, i wonder: Is it a good idea to remove
> > download/homepage entries? Is there any current machine use (other than
> > the dreaded crawling) for the homepage/download_url per-release metadata
> > fields?
> > For humans the homepage link is nicely discoverable if the long-description
> > doesn't mention it prominently. But i think there also is a "project url"
> > or "bugtrack url" for a project so maybe those could be used to reference
> > these important pages? (i am a bit confused on the exact meaning of those
> > urls, btw).
> > Should we maybe stop advertising "homepage" and "download_url"
> > and instead see to extend project-url/bugtrackurl to be used
> > and shown nicely? The latter are independent of releases which i think
> > makes sense - what use are old probably unreachable/borked homepages
> > anyway. And it's also not too bad having to go once to pypi.python.org (http://pypi.python.org)
> > to set it, usually it seldomly changes.
> I think it would be better to differentiate between showing the
> fields on the project pages, where they provide useful resources
> for people, and their use on the /simple/ index pages which are
> meant for programs to parse.
> IMO, the homepage and download links on the project pages are
> indeed very useful for people. On the /simple/ index a homepage
> link is probably not all that useful (provided a download link
> is set). The download links serve the purpose of directing
> tools to the right location, so those do belong on the /simple/
> index listings. I'd completely remove the links parsed from
> the descriptions, since those don't really provide a good
> basis for crawling (the description is meant for humans to
> parse, not programs).
> Marc-Andre Lemburg
> eGenix.com (http://eGenix.com)
> Professional Python Services directly from the Source (#1, Mar 01 2013)
> > > > Python Projects, Consulting and Support ... http://www.egenix.com/
> > > > mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
> > > > mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
> > > >
> > >
> ::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
> eGenix.com (http://eGenix.com) Software, Skills and Services GmbH Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> Catalog-SIG mailing list
> Catalog-SIG at python.org (mailto:Catalog-SIG at python.org)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Catalog-SIG