[Tutor] Trip Advisor Web Scraping

Wed Feb 22 04:39:21 EST 2017

On 22/02/17 02:55, Francis Pino wrote:
> I need to recode my hotel ratings as 1-3  = Negative and  4-5 Positive. Can
> you help point me in the direction to do this? I know I need to make a loop
> using for and in and may statement like for rating in review if review >= 3
>  print ('Negative') else print 'Negative'.  Here's my code so far.

The short answer is that it looks like:

for rating in review:
  if rating >= 3:
     print 'Positive'
  else: print 'Negative'

but...

Your code looks much more sophisticated than your question
would suggest so you should have been able to figure that
much out for yourself.. Did you write the code below yourself?
Do you understand how it works so far? Indeed, does it work so far? What
output are you seeing?

Also are you sure TripAdvisor allow web scraping - you
could find yourself blacklisted from the site. Or they
may have a web service API for extracting stats that is
both easier and more reliable to use.

> from bs4 import BeautifulSoup
> from selenium import webdriver
> import csv
> 
> url = "
> https://www.tripadvisor.com/Hotel_Review-g34678-d87695-Reviews-Clarion_Hotel_Conference_Center-Tampa_Florida.html
> "
> # driver = webdriver.Firefox()
> driver = webdriver.Chrome()
> driver.get(url)
> 
> # The HTML code for the web page is stored in the html variable
> html = driver.page_source
> 
> # we will use the soup object to parse HTML
> # BeautifulSoup reference
> # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
> 
> soup = BeautifulSoup(html, "lxml")
> 
> 
> # we will use find_all method to find all paragraph tags of class
> partial_entry
> # The result of this command is a list
> # we use for .. in [] to iterate through the list
> 
> reviews = []
> ratings= []
> 
> for review in soup.find_all("p", "partial_entry"):
>     print(review.text)
>     #print ('\n\n')
>     reviews += review
>     print(len(reviews))
> 
> # similarly we can identify the ratings
> # note that the code is incomplete - it will require additional work
> 
> 
> for rating in soup.find_all("span", "ui_bubble_rating"):
>     print(rating.text)

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos