Phillip J. Eby wrote:
I've incorporated more or less the same patch, after an unsuccesful experiment trying to pull the 'refresh:' from the HTTP headers. (Firefox showed it in the "View Page Info", so I thought it was there, and in fact it is only in the HTML. Oh well.)
So, have you changed your opinion at all about the value of screenscraping as a way to build a Python package tool? I notice you've sent two patches today that apply regexes to HTML. :)
Yeah... now you're making me second guess myself ;) I dunno... I just want stuff to work, and if a better solution comes along later that's fine too. I've always expected special code just for SF, since they are a big-and-annoying source of downloads. The whole thing tends to be stupid for Python code anyway, which usually isn't large enough to justify the complexity of mirroring systems (for example the zpt package is 35kb, and I'm sure the mirroring system takes far more resources to provide). With PyPI getting file hosting hopefully this kind of thing will go away -- which is why I don't think a general solution (outside of SF) is necessary, because SF is an anachronistic style of distribution.
I am a *little* concerned about the sourceforge support, given that they could change their download system any time, and if easy_install is distributed with Python that might make it harder to upgrade. But, at least people have the option of subclassing.
Yeah, I thought about that too. In practice SF doesn't change much. A better set of regexes might look for a hostname of prdownloads.(sf|sourceforge).net, and then any href=(".*?\?use_mirror=[^"]*"|.*?\?use_mirror=[^ >]*), both case insensitive, which is probably a little less fragile. There's a good chance if they ever change it that they'll provide documented APIs, since I'm sure there's a lot of screen scrapers similar to this one out there. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org