extracting from web pages but got disordered words sometimes
Paul McGuire
ptmcg at austin.rr.com
Sat Jan 27 14:18:24 EST 2007
On Jan 27, 5:18 am, "Frank Potter" <could.... at gmail.com> wrote:
> There are ten web pages I want to deal with.
> fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
> to http://www.af.shejis.com/new_lw/html/125936.shtml
>
> Each of them uses the charset of Chinese "gb2312", and firefox
> displays all of them in the right form, that's readable Chinese.
>
> My job is, I get every page and extract the html title of it and
> dispaly the title on linux shell Termial.
>
> And, my problem is, to some page, I get human readable title(that's in
> Chinese), but to other pages, I got disordered word. Since each page
> has the same charset, I don't know why I can't get every title in the
> same way.
>
> Here's my python code, get_title.py :
>
> [CODE]
> #!/usr/bin/python
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> min_page=125926
> max_page=125936
>
> def make_page_url(page_index):
> return ur"".join([ur"http://www.af.shejis.com/new_lw/
> html/",str(page_index),ur".shtml"])
>
> def get_page_title(page_index):
> url=make_page_url(page_index)
> print "now getting: ", url
> user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
> headers={'User-Agent':user_agent}
> req=urllib2.Request(url,None,headers)
> response=urllib2.urlopen(req)
> #print response.info()
> page=response.read()
>
> #extract tile by beautiful soup
> soup=BeautifulSoup(page)
> full_title=str(soup.html.head.title.string)
>
> #title is in the format of "title --title"
> #use this code to delete the "--" and the duplicate title
> title=full_title[full_title.rfind('-')+1::]
>
> return title
>
> for i in xrange(min_page,max_page):
> print get_page_title(i)
> [/CODE]
>
> Will somebody please help me out? Thanks in advance.
This pyparsing solution seems to extract what you were looking for,
but I don't know if this will render to Chinese or not.
-- Paul
from pyparsing import makeHTMLTags,SkipTo
import urllib
titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) +
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd
def extractTitle(htmlSource):
titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
return titleSource.titleChars
for urlIndex in range(125926,125936+1):
url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
pg = urllib.urlopen(url)
html = pg.read()
pg.close()
print url,':',extractTitle(html)
Gives:
http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
±¾µØÍø×éÍø·½Ê½³õ̽
http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ
http://www.af.shejis.com/new_lw/html/125929.shtml :
GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦
http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑݽø-
´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£©
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ
¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖеÄÓ¦ÓìØ
http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó
£Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯
http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇл»µô»°µÄ·ÖÎö¼
°½â¾ö°ì·¨
http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊл°Ä
£¿é¾ÖÓû§¹ÊÕϵÄÆÊÎö
http://www.af.shejis.com/new_lw/html/125935.shtml :
GSMÊÖ»úµ½WCDMAÖն˵ÄÑݱä
http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄάÐÞ·½·¨
More information about the Python-list
mailing list