[Catalog-sig] [Distutils] Specification for package indexes?

Phillip J. Eby pje at telecommunity.com
Wed Jul 5 17:36:03 CEST 2006


At 04:04 AM 7/5/2006 -0400, Jim Fulton wrote:

>On Jun 23, 2006, at 4:51 PM, Jim Fulton wrote:
>...
>>
>>That's a lot of screen scraping. :)
>>
>>It would be good to capture this as part of the documentation IMO
>>
>>>I'm considering adding XML-RPC support to easy_install in 0.7,
>>>though.  PyPI now has a nice XML-RPC API that is more responsive
>>>than the web UI, and it supports case-insensitive partial match
>>>searches, making it suitable for easy_install to query when a typed-
>>>in name doesn't exactly match the spelling of a PyPI entry.
>>
>>I think that would be much better.
>
>I just wanted to emphasize that I think this would be a good
>idea.

Patches welcome.  :)  Note that there should still be a fallback to the 
screen scraping code in case of a problem with the XML-RPC, to allow people 
to continue using static mirrors of PyPI or imitation PyPIs without needing 
to support XML-RPC.


>   I was just talking to Richard, and he pointed out that the
>current approach is a problem for him, because it means he can't
>evolve the pypi UI without risking breaking setuptools.

What I would suggest is creating a "microformat" for marking up web pages 
with sniffable information.  For example, adding rel="homepage" and 
rel="download" to the links that go to those URLs.

In other words, invisible hints on the page to supplement the visible 
information.  Then, I could change easy_install to start using the 
invisible hints, and drop the visible ones, freeing PyPI to evolve the UI 
again.

While the XML-RPC API would be great, I still want easy_install to be able 
to use a package index that's made from static files, and that requires 
some kind of screen scraping.  So, let's make it invisible scraping of a 
documented format, so that anybody can use it, with whatever visual formats 
they like.

Currently, easy_install gets most of its information from URLs; the only 
actual scraping of visible data is of the title, the download MD5's, and 
the table cells that identify links as being to the home page or download 
URL (since it needs to specifically identify these in order to spider them).

The MD5 information dependency could be removed if PyPI included "#md5=..." 
at the end of the download URLs; easy_install can see that information and 
use it.  The table cell checking could be removed by adding 
'rel="easy_install"' or something like that to the spiderable links.

The title checking is used to distinguish pages that list multiple packages 
from pages that list single packages.  I don't have any ready ideas as to 
how that could or should be represented in a semantic (as opposed to 
visual) way.  Your thoughts?



More information about the Catalog-sig mailing list