[Tutor] Web Page Scraping
Walter Prins
wprins at gmail.com
Tue May 24 10:37:08 EDT 2016
Hi,
On 24 May 2016 at 04:17, Crusier <crusier at gmail.com> wrote:
>
> Dear All,
>
> I am trying to scrape a web site using Beautiful Soup. However, BS
> doesn't show any of the data. I am just wondering if it is Javascript
> or some other feature which hides all the data.
>
> I have the following questions:
>
> 1) Please advise how to scrape the following data from the website:
>
> 'http://www.dbpower.com.hk/en/quote/quote-warrant/code/10348'
>
> Type, Listing Date (Y-M-D), Call / Put, Last Trading Day (Y-M-D),
> Strike Price, Maturity Date (Y-M-D), Effective Gearing (X),Time to
> Maturity (D),
> Delta (%), Daily Theta (%), Board Lot.......
>
> 2) I am able to scrape most of the data from the same site
>
> 'http://www.dbpower.com.hk/en/quote/quote-cbbc/code/63852'
>
> Please advise what is the difference between these two sites.
You didn't state which version of Python you're using, nor what
operating system, but the source contains print's with parenthesis, so
I assume some version of Python 3 and I'm going to guess you're using
Windows. Be that as it may, your program crashes with both Python 2
and Python 3. The str() conversion is flagged as a problem by
Python2, stating:
"Traceback (most recent call last):
File "test.py", line 30, in <module>
web_scraper(warrants)
File "test.py", line 25, in web_scraper
name1 = str(n.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 282: ordinal not in range(128)"
Meanwhile Python3 breaks earlier with the message:
"Traceback (most recent call last):
File "test.py", line 30, in <module>
web_scraper(warrants)
File "test.py", line 18, in web_scraper
print(soup)
File "C:\Python35-32\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in
position 435-439: character maps to <undefined>"
Both of these alert you to the fact that this is due to some encoding
issue. Aside from this your program seems to work, and the data you
say you want to retrieve is in fact returned.
So in short: If you avoid trying to implicitly encode the Unicode
result from Beautiful soup into ASCII (or the local machine codepage)
implicitly (which is what happens with your unqualified print calls)
you should avoid the problem.
But I guess you're going to want to continue to use print, and you may
therefore want to know what the issue is and how you might avoid it.
So: The reason for the problem is (basically as I understand it) that
on Windows your console (which is where the results of the print
statements go) is not Unicode aware. This implies that when you ask
Python to print a Unicode string to the console, that first of all
there must be a conversion from Unicode to something your console can
accept, to allow the print to execute. On Python 2 if you don't
explicitly deal with this, "ascii" is used which then duly falls over
if it runs into anything that doesn't map cleanly into the ASCII
character set. On Python 3, it is clever enough to figure out what my
console codepage (cp850) is, which means more characters are mappable
to my console character set, however this is still not enough to
convert character 435-439 which is encountered in the Beautifulsoup
result, as mentioned in the error message.
The way to avoid this is to tell Python how to deal with this. For
example (change lines marked with ****):
from bs4 import BeautifulSoup
import requests
import json
import re
import sys #****
warrants = ['10348']
def web_scraper(warrants):
url = "http://www.dbpower.com.hk/en/quote/quote-warrant/code/"
# Scrape from the Web
for code in warrants:
new_url = url + code
response = requests.get(new_url)
html = response.content
soup = BeautifulSoup(html,"html.parser")
print(soup.encode(sys.stdout.encoding, "backslashreplace")) #****
name = soup.findAll('div', attrs={'class': 'article_content'})
#print(name)
for n in name:
name1 = n.text #****
s_code = name1[:4]
print(name1.encode(sys.stdout.encoding, "backslashreplace")) #****
web_scraper(warrants)
Here I'm picking up the encoding from stdout, which on my machine =
"cp850". If sys.stdout.encoding is blank on your machine you might
try something explicit or as a last resort you might try "utf-8" that
should at least make the text "printable" (though perhaps not what you
want.)
I hope that helps (and look forward to possible corrections or
improved advice from other list members as I'm admittedly not an
expert on Unicode handling either.)
For reference, in future always post full error messages, and version
of Python/Operating system.
Cheers
Walter
More information about the Tutor
mailing list