[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

Tue Jul 7 07:35:07 CEST 2009

Hi,

David Kim wrote:
> I have two questions I'm hoping someone will have the patience to
> answer as an act of mercy.
> 
> I. How to get past a Terms of Service page?
> 
> I've just started learning python (have never done any programming
> prior) and am trying to figure out how to open or download a website
> to scrape data. The only problem is, whenever I try to open the link
> (via urllib2, for example) I'm after, I end up getting the HTML to a
> Terms of Service Page (where one has to click an "I Agree" button)
> rather than the actual target page.

One comment to make here is that you should first read that page and check
if the provider of the service actually allows you to automatically
download content, or to use the service in the way you want. This is
totally up to them, and if their terms of service state that you must not
do that, well, then you must not do that.

Once you know that it's permitted, you can read the ToS page and search for
the form that the "Agree" button triggers. The URL given there is the one
you have to read next, but augmented with the parameter ("?xyz=...") that
the button sends.

> I've seen examples on the web on providing data for forms (typically
> by finding the name of the form and providing some sort of dictionary
> to fill in the form fields), but this simple act of getting past "I
> Agree" is stumping me. Can anyone save my sanity? As a workaround,
> I've been using os.popen('curl ' + url ' >' filename) to save the html
> in a txt file for later processing. I have no idea why curl works and
> urllib2, for example, doesn't (I use OS X).

There may be different reasons for that. One is that web servers often
present different content based on the client identifier. So if you see one
page with one client, and another page with a different client, that may be
the reason.

> Here's the code (tho it's probably not that illuminating since it's
> basically just opening a url):
> 
> import urllib2
> url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
> #the first of 23 tables
> html = urllib2.urlopen(url).read()

Hmmm, if what you want is to read a stock ticker or something like that,
you should *really* read their ToS first and make sure they do not disallow
automated access. Because it's actually quite likely that they do.

> II. How to parse html tables with lxml, beautifulsoup? (for dummies)
> 
> Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
> need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.

Using CSS selectors (lxml.cssselect) is not at all hard. You basically
express the page structure in a *very* short and straight forward way.

Searching the web for a CSS selectors tutorial should give you a few hits.

> The basic tutorials show something like the following:
> 
> from lxml import html
> doc = html.parse("/path/to/test.txt") #the file i downloaded via curl

... or read from the standard output pipe of curl. Note that there is a
stdlib module called "subprocess", which may make running curl easier.

Once you've determined the final URL to parse, you can also push it right
into lxml's parse() function, instead of going through urllib2 or an
external tool. Example:

    url = "http://pypi.python.org/pypi?%3Aaction=search&term=lxml"
    doc = html.parse(url)

> root = doc.getroot() #what is this root business?

The root (or top-most) node of the document you just parsed. Usually an
"html" tag in HTML pages.

> tables = root.cssselect('table')

Simple, isn't it? :)

BTW, did you look at this?

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

> I understand that selecting all the table tags will somehow target
> however many tables on the page. The problem is the table has multiple
> headers, empty cells, etc. Most of the examples on the web have to do
> with scraping the web for search results or something that don't
> really depend on the table format for anything other than layout.

That's because in cases like yours, you have to do most of the work
yourself anyway. No page is like the other, so you have to find your way
through the structure and figure out fixed points that allow you to get to
the data.

Stefan