Output of HTML parsing

Fri Jun 15 14:01:06 EDT 2007

Jackie wrote:
> I want to get the information of the professors (name,title) from the
> following link:
> 
> "http://www.economics.utoronto.ca/index.php/index/person/faculty/"

That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml

> Ideally, I'd like to have a output file where each line is one Prof,
> including his name and title. In practice, I use the CSV module.
> ----------------------------------------------------
> 
> import urllib,re,csv
> 
> url = "http://www.economics.utoronto.ca/index.php/index/person/
> faculty/"
> 
> sock = urllib.urlopen(url)
> htmlSource = sock.read()
> sock.close()

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

> namePattern = re.compile(r'class="name">(.*)</a>')
> titlePattern = re.compile(r'</a>, (.*)\s*</td>')
> 
> name = namePattern.findall(htmlSource)
> title_temp = titlePattern.findall(htmlSource)
> title =[]
> for item in title_temp:
>     item_new=" ".join(item.split())                #Suppress the
> spaces between 'title' and </td>
>     title.extend([item_new])
> 
> 
> output =[]
> for i in range(len(name)):
>     output.insert(i,[name[i],title[i]])            #Generate a list of
> [name, title]

# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
  name_list.append(
    tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )

> writer = csv.writer(open("professor.csv", "wb"))
> writer.writerows(output)                           #output CSV file

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list)                         #output CSV file
> -------------- End of Program
> ----------------------------------------------
> 
> 3.Should I close the opened csv file("professor.csv")? How to close
> it?

I guess it has a "close()" function?

Stefan