Retrieve url's of all jpegs at a web page URL
Stefan Behnel
stefan_ml at behnel.de
Wed Sep 16 00:32:22 EDT 2009
Chris Rebert wrote:
> page_url = "http://the.url.here"
>
> with urllib.urlopen(page_url) as f:
> soup = BeautifulSoup(f.read())
> for img_tag in soup.findAll("img"):
> relative_url = img_tag.src
> img_url = make_absolute(relative_url, page_url)
> save_image_from_url(img_url)
>
> 2. Write make_absolute() and save_image_from_url()
Note that lxml.html provides a make_links_absolute() function.
Also untested:
from lxml import html
doc = html.parse(page_url)
doc.make_links_absolute(page_url)
urls = [ img.src for img in doc.xpath('//img') ]
Then use e.g. urllib2 to save the images.
Stefan
More information about the Python-list
mailing list