extracting from web pages but got disordered words sometimes

Paul McGuire ptmcg at austin.rr.com
Sat Jan 27 14:18:24 EST 2007


On Jan 27, 5:18 am, "Frank Potter" <could.... at gmail.com> wrote:
> There are ten web pages I want to deal with.
> fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
> to      http://www.af.shejis.com/new_lw/html/125936.shtml
>
> Each of them uses the charset of Chinese "gb2312", and firefox
> displays all of them in the right form, that's readable Chinese.
>
> My job is, I get every page and extract the html title of it and
> dispaly the title on linux shell Termial.
>
> And, my problem is, to some page, I get human readable title(that's in
> Chinese), but to other pages, I got disordered word. Since each page
> has the same charset, I don't know why I can't get every title in the
> same way.
>
> Here's my python code, get_title.py :
>
> [CODE]
> #!/usr/bin/python
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> min_page=125926
> max_page=125936
>
> def make_page_url(page_index):
>     return ur"".join([ur"http://www.af.shejis.com/new_lw/
> html/",str(page_index),ur".shtml"])
>
> def get_page_title(page_index):
>     url=make_page_url(page_index)
>     print "now getting: ", url
>     user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>     headers={'User-Agent':user_agent}
>     req=urllib2.Request(url,None,headers)
>     response=urllib2.urlopen(req)
>     #print response.info()
>     page=response.read()
>
>     #extract tile by beautiful soup
>     soup=BeautifulSoup(page)
>     full_title=str(soup.html.head.title.string)
>
>     #title is in the format of "title --title"
>     #use this code to delete the "--" and the duplicate title
>     title=full_title[full_title.rfind('-')+1::]
>
>     return title
>
> for i in xrange(min_page,max_page):
>     print get_page_title(i)
> [/CODE]
>
> Will somebody please help me out? Thanks in advance.

This pyparsing solution seems to extract what you were looking for, 
but I don't know if this will render to Chinese or not.

-- Paul

from pyparsing import makeHTMLTags,SkipTo
import urllib

titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) + 
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

def extractTitle(htmlSource):
    titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
    return titleSource.titleChars


for urlIndex in range(125926,125936+1):
    url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
    pg = urllib.urlopen(url)
    html = pg.read()
    pg.close()
    print url,':',extractTitle(html)


Gives:

http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
±¾µØÍø×éÍø·½Ê½³õ̽
http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ
http://www.af.shejis.com/new_lw/html/125929.shtml : 
GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦
http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑݽø-
´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£©
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ
¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖеÄÓ¦ÓìØ
http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó
£Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯
http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇл»µô»°µÄ·ÖÎö¼
°½â¾ö°ì·¨
http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊл°Ä
£¿é¾ÖÓû§¹ÊÕϵÄÆÊÎö
http://www.af.shejis.com/new_lw/html/125935.shtml : 
GSMÊÖ»úµ½WCDMAÖն˵ÄÑݱä
http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄάÐÞ·½·¨




More information about the Python-list mailing list