[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

Mon Jul 6 23:54:58 CEST 2009

Hello all,

I have two questions I'm hoping someone will have the patience to
answer as an act of mercy.

I. How to get past a Terms of Service page?

I've just started learning python (have never done any programming
prior) and am trying to figure out how to open or download a website
to scrape data. The only problem is, whenever I try to open the link
(via urllib2, for example) I'm after, I end up getting the HTML to a
Terms of Service Page (where one has to click an "I Agree" button)
rather than the actual target page.

I've seen examples on the web on providing data for forms (typically
by finding the name of the form and providing some sort of dictionary
to fill in the form fields), but this simple act of getting past "I
Agree" is stumping me. Can anyone save my sanity? As a workaround,
I've been using os.popen('curl ' + url ' >' filename) to save the html
in a txt file for later processing. I have no idea why curl works and
urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo
Pipes to try and sidestep coding anything altogether, but ended up
looking at the same Terms of Service page anyway.

Here's the code (tho it's probably not that illuminating since it's
basically just opening a url):

import urllib2
url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
#the first of 23 tables
html = urllib2.urlopen(url).read()

II. How to parse html tables with lxml, beautifulsoup? (for dummies)

Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.
I've tried looking at the documentation included with different python
libraries, but just got more confused.

The basic tutorials show something like the following:

from lxml import html
doc = html.parse("/path/to/test.txt") #the file i downloaded via curl
root = doc.getroot() #what is this root business?
tables = root.cssselect('table')

I understand that selecting all the table tags will somehow target
however many tables on the page. The problem is the table has multiple
headers, empty cells, etc. Most of the examples on the web have to do
with scraping the web for search results or something that don't
really depend on the table format for anything other than layout. Are
there any resources out there that are appropriate for web/python
illiterati like myself that deal with structured data as in the url
above?

FYI, the data in the url above goes up in smoke every week, so I'm
trying to capture it automatically on a weekly basis. Getting all of
it into a CSV or database would be a personal cause for celebration as
it would be the first really useful thing I've done with python since
starting to learn it a few months ago.

For anyone who is interested, here is the code that uses "curl" to
pull the webpages. It basically just builds the url string for the
different table-pages and saves down the file with a timestamped
filename:

import os
from time import strftime

BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_'
SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)},
            'section2':{'select':'ii.php?id=table', 'id':range(9,17)},
            'section3':{'select':'iii.php?id=table', 'id':range(17,24)}
            }

def get_pages():

    filenames = []
    path = '~/Dev/Data/DTCC_DerivServ/'
    #os.popen('cd ' + path)

    for section in SECTIONS:
        for id in SECTIONS[section]['id']:
            #urlList.append(BASE_URL + SECTIONS[section]['select']+str(id))
            url = BASE_URL + SECTIONS[section]['select'] + str(id)
            timestamp = strftime('%Y%m%d_')
            #sectionName = BASE_URL.split('/')[-1]
            sectionNumber = SECTIONS[section]['select'].split('.')[0]
            tableNumber = str(id) + '_'
            filename = timestamp + tableNumber + sectionNumber + '.txt'
            os.popen('curl ' + url + '> ' + path + filename)
            filenames.append(filename)

    return filenames

if (__name__ == '__main__'):
    get_pages()

--
morenotestoself.wordpress.com