BeautifulSoup

Peter Otten __peter__ at web.de
Wed Jan 13 15:11:05 CET 2010


yamamoto wrote:

> Hi,
> I am new to Python. I'd like to extract "a" tag from a website by
> using "beautifulsoup" module.
> but it doesnt work!
> 
> //sample.py
> 
> from BeautifulSoup import BeautifulSoup as bs
> import urllib
> url="http://www.d-addicts.com/forum/torrents.php"
> doc=urllib.urlopen(url).read()
> soup=bs(doc)
> result=soup.findAll("a")
> for i in result:
>     print i
> 
> 
> Traceback (most recent call last):
>   File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
>     soup=bs(doc)
>   File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
> __init__
>     BeautifulStoneSoup.__init__(self, *args, **kwargs)
>   File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
> __init__
>     self._feed(isHTML=isHTML)
>   File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
> _feed
>     self.builder.feed(markup)
>   File "C:\Python26\lib\HTMLParser.py", line 108, in feed
>     self.goahead(0)
>   File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
>     k = self.parse_starttag(i)
>   File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
>     endpos = self.check_for_whole_start_tag(i)
>   File "C:\Python26\lib\HTMLParser.py", line 301, in
> check_for_whole_start_tag
>     self.error("malformed start tag")
>   File "C:\Python26\lib\HTMLParser.py", line 115, in error
>     raise HTMLParseError(message, self.getpos())
> HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36
> 
> any suggestion?

When BeautifulSoup encounters an error that it cannot fix the first thing 
you need is a better error message:


from BeautifulSoup import BeautifulSoup as bs
import urllib
import HTMLParser

url = "http://www.d-addicts.com/forum/torrents.php"
doc = urllib.urlopen(url).read()

#doc = doc.replace("\>", "/>")

try:
    soup=bs(doc)
except HTMLParser.HTMLParseError as e:
    lines = doc.splitlines(True)
    print lines[e.lineno-1].rstrip()
    print " " * e.offset + "^"
else:
    result = soup.findAll("a")
    for i in result:
        print i

Once you know the origin of the problem you can devise a manual fix. Here 
you could uncomment the line

doc = doc.replace("\>", "/>")

Keep in mind though that what fixes this broken document may break another 
(valid) one.

Peter



More information about the Python-list mailing list