[Tutor] BeautifulSoup: cut first n tags?

Kent Johnson kent_johnson at skillsoft.com
Sat Oct 23 01:40:36 CEST 2004


I would say, if you have something that works, be happy. With website 
scraping you are always at the mercy of the site authors. Your script could 
break in many ways if the site changes. I wouldn't try to guess what those 
are. When it breaks, you can figure out how to fix it.

Kent

At 07:57 PM 10/18/2004 -0400, Matej Cepl wrote:
>Hi,
>
>I am quite amazed by the beauty of your BeautifulSoup (it is truly
>beautiful), but still I have one problem which I would like to resolve:
>
>I have a not so bad webpage (Boston Globe story, version for print) on
>http://www.ceplovi.cz/matej/tmp/globe.html and I would to get some very
>clean stuff from it. It is not problem to get some interesting information
>from the <div class="story"> element, but I haven't figure out how to get
>the story. Let's see what I have:
>
>  from BeautifulSoup import BeautifulSoup
>  def get_content(soup,element,cls):
>   return string.strip(str(ent.first(element,{'class':cls}).contents[0]))
>
>  html = open("globe.html","r").read()
>  soup = BeautifulSoup()
>  soup.feed(html)
>  story = soup.first("div",{'class':'story'})
>  headline = get_content(story,'h1','mainHead')
>  subhead = get_content(story,'h2','subHead')
>  author = get_content(story,'p','byline')
>  date = string.strip(str(story.first('span',\
>   {'style':'white-space: nowrap;'})
>
>So far, surprisingly easy. But how can I get "all remaining tags (not only
><p>s) in story after (and without) <p> element which is class 'byline' and
>no, I don't need any <img> elements, thanks!"? Is there any way how to
>work with the ALL elements as a simple list? I know, that I can do
>something like
>
>  body = story.fetch('p')[1:]
>
>but what if some unfortunate author unexepctedly decides that he doesn't
>want to make such ugly soup after all and uses some other tag than <p>
>(<blockquote>,<dl>, or even <ul> comes to mind)?
>
>  Thanks a lot,
>
>   Matej
>
>--
>Matej Cepl, http://www.ceplovi.cz/matej
>GPG Finger: 89EF 4BC6 288A BF43 1BAB  25C3 E09F EF25 D964 84AC
>138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488
>
>The function of the expert is not to be more right than other
>people, but to be wrong for more sophisticated reasons.
>     -- Dr. David Butler, British psephologist
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list