Error getting data from website
Peter Otten
__peter__ at web.de
Sat Dec 7 05:53:59 EST 2019
Michael Torrie wrote:
> On 12/6/19 5:31 PM, DL Neil via Python-list wrote:
>> If you read the HTML data that the REPL has happily splattered all over
>> your terminal's screen (scroll back) (NB "soup" is easier to read than
>> is "content"!) you will observe that what you saw in your web-browser is
>> not what Amazon served in response to the Python "requests.get()"!
>
> Sadly it's likely that Amazon's page is largely built from javascript.
That's not the problem here. Quoting the html returned by
requests.get("https://www.amazon.ca/dp/B07RZFQ6HC")
"""
To discuss automated access to Amazon data please contact api-services-
support at amazon.com.
"""
If you retrieve the page manually:
$ wget "https://www.amazon.ca/dp/B07RZFQ6HC" -O tmp.gz
[...]
2019-12-07 11:47:03 (80,6 KB/s) - »tmp.gz« gespeichert [115426]
$ gunzip tmp.gz
$ python3
[...]
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open("tmp").read())
>>> soup.find("span", dict(id="priceblock_dealprice")
... )
<span class="a-size-medium a-color-price priceBlockDealPriceString"
id="priceblock_dealprice">CDN$ 1,019.00</span>
>>> _.text
'CDN$\xa01,019.00'
> So scraping static html is probably not going to get you where you want
> to go.
... because Amazon doesn' like what you do. You can cheat or play by their
rules and use the API.
> There are heavier tools, such as Selenium that uses a real
> browser to grab a page, and the result of that you can parse and search
> perhaps.
More information about the Python-list
mailing list