[Tutor] Unable to download <th>, <td> using Beautifulsoup

Alan Gauld alan.gauld at yahoo.co.uk
Fri Jul 29 18:59:53 EDT 2016


On 29/07/16 23:10, bruce wrote:

> The most "complete" is the use of a headless browser. However, the
> use/implementation of a headless browser has its' own share of issues.
> Speed, complexity, etc...

Walter and Bruce have jumped ahead a few steps from where I was
heading but basically it's an increasingly common scenario where
web pages are no longer primarily html but rather are
Javascript programs that fetch data dynamically.

A headless browser is the brute force way to deal with such issues
but a better (purer?) way is to access the same API that the browser
is using. Many web sites now publish RESTful APIs with web
services that you can call directly. It is worth investigating
whether your target has this. If so that will generally provide
a much nicer solution than trying to drive a headless browser.

Finally you need to consider whether you have the right to the
data without running a browser? Many sites provide information
for free but get paid by adverts. If you bypass the web screen
(adverts) you  bypass their revenue and they do not allow that.
So you need to be sure that you are legally entitled to scrape
data from the site or use an API.

Otherwise you may be on the wrong end of a law suite, or at
best be contributing to the demise of the very site you are
trying to use.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list