scraping a tumblr.com archive page

Benjamin Kaplan benjamin.kaplan at case.edu
Sun Nov 20 13:18:21 EST 2011


On Sun, Nov 20, 2011 at 1:06 PM, Jabba Laci <jabba.laci at gmail.com> wrote:
> Hi,
>
> I want to extract the URLs of all the posts on a tumblr blog. Let's
> take for instance this blog: http://loveyourchaos.tumblr.com/archive .
> If I download this page with a script, there are only 50 posts in the
> HTML. If you scroll down in your browser to the end of the archive,
> the browser will dynamically load newer and newer posts.
>
> How to scrape such a dynamic page?
>
> Thanks,
>
> Laszlo
> --

The page isn't really that dynamic- HTTP doesn't allow for that.
Scrolling down the page triggers some Javascript. That Javascript
sends some HTTP requests to the server, which returns more HTML, which
gets stuck into the middle of the page. If you take the time to
monitor your network traffic using a tool like Firebug, you should be
able to figure out the pattern in the requests for more content. Just
send those same requests yourself and parse the results.



More information about the Python-list mailing list