how can i use lxml with win32com?

Michiel Overtoom motoom at xs4all.nl
Sun Oct 25 09:52:55 EDT 2009


elca wrote:

> http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0
> that is korea portal site and i was search keyword using 'korea times'
> and i want to scrap resulted to text name with 'blogscrap_save.txt'

Aha, now we're getting somewhere.

Getting and parsing that page is no problem, and doesn't need JavaScript 
or Internet Explorer.

import urllib2
import BeautifulSoup
doc=urllib2.urlopen("http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0")
soup=BeautifulSoup.BeautifulSoup(doc)


By analyzing the structure of that page you can see that the articles 
are presented in an unordered list which has class "type01".  The 
interesting bit in each list item is encapsulated in a <dd> tag with 
class "sh_news_passage".  So, to parse the articles:

ul=soup.find("ul","type01")
for li in ul.findAll("li"):
     dd=li.find("dd","sh_news_passage")
     print dd.renderContents()
     print

This example prints them, but you could also save them to a file (or a 
database, whatever).

Greetings,



-- 
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html



More information about the Python-list mailing list