Output of HTML parsing
Stefan Behnel
stefan.behnel-n05pAM at web.de
Fri Jun 15 14:01:06 EDT 2007
Jackie wrote:
> I want to get the information of the professors (name,title) from the
> following link:
>
> "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.
http://codespeak.net/lxml
> Ideally, I'd like to have a output file where each line is one Prof,
> including his name and title. In practice, I use the CSV module.
> ----------------------------------------------------
>
> import urllib,re,csv
>
> url = "http://www.economics.utoronto.ca/index.php/index/person/
> faculty/"
>
> sock = urllib.urlopen(url)
> htmlSource = sock.read()
> sock.close()
import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
> namePattern = re.compile(r'class="name">(.*)</a>')
> titlePattern = re.compile(r'</a>, (.*)\s*</td>')
>
> name = namePattern.findall(htmlSource)
> title_temp = titlePattern.findall(htmlSource)
> title =[]
> for item in title_temp:
> item_new=" ".join(item.split()) #Suppress the
> spaces between 'title' and </td>
> title.extend([item_new])
>
>
> output =[]
> for i in range(len(name)):
> output.insert(i,[name[i],title[i]]) #Generate a list of
> [name, title]
# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )
> writer = csv.writer(open("professor.csv", "wb"))
> writer.writerows(output) #output CSV file
writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file
> -------------- End of Program
> ----------------------------------------------
>
> 3.Should I close the opened csv file("professor.csv")? How to close
> it?
I guess it has a "close()" function?
Stefan
More information about the Python-list
mailing list