[Catalog-sig] [Distutils] Specification for package indexes?
Phillip J. Eby
pje at telecommunity.com
Wed Jul 5 17:36:03 CEST 2006
At 04:04 AM 7/5/2006 -0400, Jim Fulton wrote:
>On Jun 23, 2006, at 4:51 PM, Jim Fulton wrote:
>...
>>
>>That's a lot of screen scraping. :)
>>
>>It would be good to capture this as part of the documentation IMO
>>
>>>I'm considering adding XML-RPC support to easy_install in 0.7,
>>>though. PyPI now has a nice XML-RPC API that is more responsive
>>>than the web UI, and it supports case-insensitive partial match
>>>searches, making it suitable for easy_install to query when a typed-
>>>in name doesn't exactly match the spelling of a PyPI entry.
>>
>>I think that would be much better.
>
>I just wanted to emphasize that I think this would be a good
>idea.
Patches welcome. :) Note that there should still be a fallback to the
screen scraping code in case of a problem with the XML-RPC, to allow people
to continue using static mirrors of PyPI or imitation PyPIs without needing
to support XML-RPC.
> I was just talking to Richard, and he pointed out that the
>current approach is a problem for him, because it means he can't
>evolve the pypi UI without risking breaking setuptools.
What I would suggest is creating a "microformat" for marking up web pages
with sniffable information. For example, adding rel="homepage" and
rel="download" to the links that go to those URLs.
In other words, invisible hints on the page to supplement the visible
information. Then, I could change easy_install to start using the
invisible hints, and drop the visible ones, freeing PyPI to evolve the UI
again.
While the XML-RPC API would be great, I still want easy_install to be able
to use a package index that's made from static files, and that requires
some kind of screen scraping. So, let's make it invisible scraping of a
documented format, so that anybody can use it, with whatever visual formats
they like.
Currently, easy_install gets most of its information from URLs; the only
actual scraping of visible data is of the title, the download MD5's, and
the table cells that identify links as being to the home page or download
URL (since it needs to specifically identify these in order to spider them).
The MD5 information dependency could be removed if PyPI included "#md5=..."
at the end of the download URLs; easy_install can see that information and
use it. The table cell checking could be removed by adding
'rel="easy_install"' or something like that to the spiderable links.
The title checking is used to distinguish pages that list multiple packages
from pages that list single packages. I don't have any ready ideas as to
how that could or should be represented in a semantic (as opposed to
visual) way. Your thoughts?
More information about the Catalog-sig
mailing list