using urllib on a more complex site
Adam W.
AWasilenko at gmail.com
Sun Feb 24 20:28:00 EST 2013
On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
>
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
>
> >
>
> > in order to send myself an email every day of the 99c movie of the day.
>
> >
>
> > However, using a simple command like (in Python 3.0):
>
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
>
> >
>
> > I don't get the all the source I need, its just the navigation buttons. Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
>
> >
>
>
>
> The CSS and the jpegs, and many other aspects of a web "page" are loaded
>
> explicitly, by the browser, when parsing the tags of the page you
>
> downloaded. There is no sooner or later. The website won't send the
>
> other files until you request them.
>
>
>
> For example, that site at the moment has one image (prob. jpeg)
>
> highlighted,
>
>
>
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m"
>
> alt="Sex and the City: The Movie (Theatrical)">
>
>
>
> if you want to look at that jpeg, you need to download the file url
>
> specified by the src attribute of that img element.
>
>
>
> Or perhaps you can just look at the 'alt' attribute, which is mainly
>
> there for browsers who don't happen to do graphics, for example, the
>
> ones for the blind.
>
>
>
> Naturally, there may be dozens of images on the page, and there's no
>
> guarantee that the website author is trying to make it easy for you.
>
> Why not check if there's a defined api for extracting the information
>
> you want? Check the site, or send a message to the webmaster.
>
>
>
> No guarantee that tomorrow, the information won't be buried in some
>
> javascript fragment. Again, if you want to see that, you might need to
>
> write a javascript interpreter. it could use any algorithm at all to
>
> build webpage information, and the encoding could change day by day, or
>
> hour by hour.
>
>
>
> --
>
> DaveA
The problem is, the image url you found is not returned in the data urllib grabs. To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.
I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.
More information about the Python-list
mailing list