[Tutor] Unable to download <th>, <td> using Beautifulsoup

bruce badouglas at gmail.com
Fri Jul 29 18:10:04 EDT 2016


In following up/on what Walter said.

If the browser without cookies/javascript enabled doesn't generate the
content, you need to have a different approach.

The most "complete" is the use of a headless browser. However, the
use/implementation of a headless browser has its' own share of issues.
Speed, complexity, etc...

A potentially better/useful method is to view/look at the traffic
(livehttpheaders for Firefox) to get a feel for exactly what the browser
requires. At the same time, view the subordinate jscript functions.

I've found it's often enough to craft the requisite cookies/curl functions
in order to simulate the browser data.

In a few cases though, I've run across situations where a headless browser
is the only real soln.



On Fri, Jul 29, 2016 at 3:28 AM, Crusier <crusier at gmail.com> wrote:

> I am using Python 3 on Windows 7.
>
> However, I am unable to download some of the data listed in the web
> site as follows:
>
> http://data.tsci.com.cn/stock/00939/STK_Broker.htm
>
> 453.IMC 98.28M 18.44M 4.32 5.33 1499.Optiver 70.91M 13.29M 3.12 5.34
> 7387.花旗环球 52.72M 9.84M 2.32 5.36
>
> When I use Google Chrome and use 'View Page Source', the data does not
> show up at all. However, when I use 'Inspect', I can able to read the
> data.
>
> '<th>1453.IMC</th>'
> '<td>98.28M</td>'
> '<td>18.44M</td>'
> '<td>4.32</td>'
> '<td>5.33</td>'
>
> '<th>1499.Optiver </th>'
> '<td> 70.91M</td>'
> '<td>13.29M </td>'
> '<td>3.12</td>'
> '<td>5.34</td>'
>
> Please kindly explain to me if the data is hide in CSS Style sheet or
> is there any way to retrieve the data listed.
>
> Thank you
>
> Regards, Crusier
>
> from bs4 import BeautifulSoup
> import urllib
> import requests
>
>
>
>
> stock_code = ('00939', '0001')
>
> def web_scraper(stock_code):
>
>     broker_url = 'http://data.tsci.com.cn/stock/'
>     end_url = '/STK_Broker.htm'
>
>     for code in stock_code:
>
>         new_url  = broker_url + code + end_url
>         response = requests.get(new_url)
>         html = response.content
>         soup = BeautifulSoup(html, "html.parser")
>         Buylist = soup.find_all('div', id ="BuyingSeats")
>         Selllist = soup.find_all('div', id ="SellSeats")
>
>
>         print(Buylist)
>         print(Selllist)
>
>
>
> web_scraper(stock_code)
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list