Output of html parsing

Jackie Wang jackie_python at yahoo.ca
Fri Jun 15 15:39:30 CEST 2007


Hi, all,
   
  I want to get the information of the professors (name,title) from the following link:
   
  "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
   
  Ideally, I'd like to have a output file where each line is one Prof, including his name and title. In practice, I use the CSV module.
   
  The following is my program:
  
--------------- Program ----------------------------------------------------
  import urllib,re,csv
   
  url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
   
  sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
   
  namePattern = re.compile(r'class="name">(.*)</a>') 
titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')
  name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
    item_new=" ".join(item.split())                #Suppress the spaces between 'title' and </td>
    title.extend([item_new])
    
  output =[] 
for i in range(len(name)):
    output.insert(i,[name[i],title[i]])            #Generate a list of [name, title]
    
writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output)                           #output CSV file
  -------------- End of Program ----------------------------------------------
   
  My questions are:
   
  1.The code above assume that each Prof has a tilte. If any one of them does not, the name and title will be mismatched. How to program to allow that 
  title can be empty?
   
  2.Is there any easier way to get the data I want other than using list?
   
  3.Should I close the opened csv file("professor.csv")? How to close it?
   
  Thanks!
   
  Jackie
  
 

       
---------------------------------
 All new Yahoo! Mail - 
---------------------------------
Get a sneak peak at messages with a handy reading pane.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070615/d445ce0d/attachment.html>


More information about the Python-list mailing list