[Tutor] Unable to download <th>, <td> using Beautifulsoup

Fri Jul 29 17:23:54 EDT 2016

Hi

On 29 July 2016 at 08:28, Crusier <crusier at gmail.com> wrote:

> I am using Python 3 on Windows 7.
>

> When I use Google Chrome and use 'View Page Source', the data does not
> show up at all. However, when I use 'Inspect', I can able to read the
> data.
>
> Please kindly explain to me if the data is hide in CSS Style sheet or
> is there any way to retrieve the data listed.
>

Using inspect is not the same as view source. The inspect tools will give
you the DOM tree as it currently is in the browser. That tree may have been
modified by any number of things (e.g. likely Javascript) since the initial
page source was loaded.  It is likely that the data you're trying to get
at, is fetched dynamically after the initial page source is fetched, which
is why you don't see it when using "view source."

As an experiment you can temporarily disable your browser's Javascript
engine and reload the webpage.  You should then find that you can't see the
data you're after at all, even with inspect, if this is what's occurring.
 (To do this for Chrome, see here:
https://productforums.google.com/forum/#!topic/chrome/BYOQskiuGU0)

So, if this is what's going on then this presents you with a bit of a
problem. Obviously the Python "requests" module is not a full browser and
does not include a Javascript runtime, so it cannot by itself yield the
same result as a real browser, if the page content is in fact the result of
dynamic population by Javascript after loading the initial HTML page
source.

In order to get around this you would therefore need to fundamentally have
a browser of some kind that you can control, and that includes a Javascript
runtime that can effectively process and construct the DOM (and render the
page image if you so desire) before you retrieve the data you're after.

It should be possible to do this, there are projects and questions on the
internet about this.  Firstly there's a project named "Selenium" that
provides a way of automating various browsers, and has Python bindings (I
used this some years ago).  So you could conceivably use
Python+Selenium+(Chrome or Firefox say) for example to fetch the page and
then get the data out.  This has the disadvantage that there's going to be
a real browser and browser window floating around.

A slightly better alternative would be to use a "headless" (displayless)
browser, such as PhantomJS.  It is basically the browser engine with lots
of ways to control and automate it.  It does not (to my knowledge) include
Python bindings directly, but Selenium includes a PhantomJS driver (I
think.)  There's lighter weight options like "jsdom" and "slimerjs", but I
have no idea whether these would suffice or not or whether they would have
Python wrappers or not.

Perhaps the best option might be Ghost.py, which sounds like it might be
exactly what you need, but I have no experience with it.

So, I'm afraid to achieve what you want will require a rather more
complicated solution than what you've currently got.  :(

Nevertheless, here's some links for you:

Ghost.py:
http://jeanphix.me/Ghost.py/
http://ghost-py.readthedocs.io/en/latest/#

PhantomJS:
http://phantomjs.org/

PhantomJS & Python:
http://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python
http://toddhayton.com/2015/02/03/scraping-with-python-selenium-and-phantomjs/

SlimerJS:
http://docs.slimerjs.org/0.9/quick-start.html

I also while answering this question stumbled over the following page
listing (supposedly) almost every headless browser or framework in
existence:
https://github.com/dhamaniasad/HeadlessBrowsers

I see there's a couple of other possible options on there, but I'll leave
it up to you to investigate.

Good luck,

Walter