BeautifulSoup: problems with parsing a website
Stefan Behnel
stefan_ml at behnel.de
Wed May 28 16:04:23 EDT 2008
Marco Hornung wrote:
> Hy guys,
... and girls?
> I'm using the python-framework BeautifulSoup(BS) to parse some
> information out of a german soccer-website.
consider using lxml.
http://codespeak.net/lxml
>>> from lxml import html
> I want to parse the article shown on the website.
>>> tree = html.parse("http://www.bundesliga.de/de/liga/news/
2007/index.php?f=94820.php")
> To do so I want to
> use the Tag " <div class="txt_fliesstext">" as a starting-point.
>>> div = tree.xpath('//div[@class = "txt_fliesstext"]')
> When
> I have found the Tag I somehow want to get all following "br"-Tags
Following? Meaning: after the div?
>>> br_list = diff.xpath("following-sibling::br")
Or within the div?
>>> br_list = diff.xpath(".//br")
> until there is a new CSS-Class Style is coming up.
Ok, that's different.
>>> for el in div.iter(): # or div.itersiblings():
... if el.tag == "br":
... print el.text # or whatever
... elif el.tag == "span" or el.get("class"):
... break
Hope it helps.
Stefan
More information about the Python-list
mailing list