using urllib on a more complex site
davea at davea.name
Mon Feb 25 01:30:00 CET 2013
On 02/24/2013 07:02 PM, Adam W. wrote:
> I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> in order to send myself an email every day of the 99c movie of the day.
> However, using a simple command like (in Python 3.0):
The CSS and the jpegs, and many other aspects of a web "page" are loaded
explicitly, by the browser, when parsing the tags of the page you
downloaded. There is no sooner or later. The website won't send the
other files until you request them.
For example, that site at the moment has one image (prob. jpeg)
<img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m"
alt="Sex and the City: The Movie (Theatrical)">
if you want to look at that jpeg, you need to download the file url
specified by the src attribute of that img element.
Or perhaps you can just look at the 'alt' attribute, which is mainly
there for browsers who don't happen to do graphics, for example, the
ones for the blind.
Naturally, there may be dozens of images on the page, and there's no
guarantee that the website author is trying to make it easy for you.
Why not check if there's a defined api for extracting the information
you want? Check the site, or send a message to the webmaster.
No guarantee that tomorrow, the information won't be buried in some
build webpage information, and the encoding could change day by day, or
hour by hour.
More information about the Python-list