[Tutor] python internet archive API?

Luke Paireepinart rabidpoobear at gmail.com
Thu Apr 26 08:39:47 CEST 2007


Switanek, Nick wrote:
>
> I’m a novice Python programmer, and I’ve been looking for a way to 
> collect archived web pages. I would like to use the data on Internet 
> Archive, via the “Wayback Machine”. Look, for example, at 
> http://web.archive.org/web/*/http://www.python.org 
> <http://web.archive.org/web/*/http:/www.python.org>. I’d like to crawl 
> down the first few levels of links of each of the updated archived 
> pages (the ones with *’s next to them). The site’s robots.txt 
> exclusions are complete, so a screen-scraping strategy doesn’t seem 
> doable.
>
What does the robots.txt have to do with anything?
Just ignore it.
If the robots.txt is telling you not to do something, you know that they 
don't want you to do it.
But if have a valid reason, just do it anyway.
>
> Does anyone have any suggestions for a way to go about this pythonically?
>
> Many thanks,
>
> Nick
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>   



More information about the Tutor mailing list