At 04:04 AM 7/5/2006 -0400, Jim Fulton wrote:
On Jun 23, 2006, at 4:51 PM, Jim Fulton wrote: ...
That's a lot of screen scraping. :)
It would be good to capture this as part of the documentation IMO
I'm considering adding XML-RPC support to easy_install in 0.7, though. PyPI now has a nice XML-RPC API that is more responsive than the web UI, and it supports case-insensitive partial match searches, making it suitable for easy_install to query when a typed- in name doesn't exactly match the spelling of a PyPI entry.
I think that would be much better.
I just wanted to emphasize that I think this would be a good idea.
Patches welcome. :) Note that there should still be a fallback to the screen scraping code in case of a problem with the XML-RPC, to allow people to continue using static mirrors of PyPI or imitation PyPIs without needing to support XML-RPC.
I was just talking to Richard, and he pointed out that the current approach is a problem for him, because it means he can't evolve the pypi UI without risking breaking setuptools.
What I would suggest is creating a "microformat" for marking up web pages with sniffable information. For example, adding rel="homepage" and rel="download" to the links that go to those URLs.
In other words, invisible hints on the page to supplement the visible information. Then, I could change easy_install to start using the invisible hints, and drop the visible ones, freeing PyPI to evolve the UI again.
While the XML-RPC API would be great, I still want easy_install to be able to use a package index that's made from static files, and that requires some kind of screen scraping. So, let's make it invisible scraping of a documented format, so that anybody can use it, with whatever visual formats they like.
Currently, easy_install gets most of its information from URLs; the only actual scraping of visible data is of the title, the download MD5's, and the table cells that identify links as being to the home page or download URL (since it needs to specifically identify these in order to spider them).
The MD5 information dependency could be removed if PyPI included "#md5=..." at the end of the download URLs; easy_install can see that information and use it. The table cell checking could be removed by adding 'rel="easy_install"' or something like that to the spiderable links.
The title checking is used to distinguish pages that list multiple packages from pages that list single packages. I don't have any ready ideas as to how that could or should be represented in a semantic (as opposed to visual) way. Your thoughts?