Trying to understand html.parser.HTMLParser

Mon May 16 03:26:26 EDT 2011

On 05/16/2011 03:06 AM, David Robinow wrote:
> On Sun, May 15, 2011 at 4:45 PM, Andrew Berg<bahamutzero8825 at gmail.com>  wrote:
>> I'm trying to understand why HMTLParser.feed() isn't returning the whole
>> page. My test script is this:
>>
>> import urllib.request
>> import html.parser
>> class MyHTMLParser(html.parser.HTMLParser):
>>     def handle_starttag(self, tag, attrs):
>>         if tag == 'a' and attrs:
>>             print(tag,'-',attrs)
>>
>> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
>> page = urllib.request.urlopen(url).read()
>> parser = MyHTMLParser()
>> parser.feed(str(page))
>>
>> I can do print(page) and get the entire HTML source, but
>> parser.feed(str(page)) only spits out the information for the top links
>> and none of the "revisionxxxx" links. Ultimately, I just want to find
>> the name of the first "revisionxxxx" link (right now it's
>> "revision1995", when a new build is uploaded it will be "revision2000"
>> or whatever). I figure this is a relatively simple page; once I
>> understand all of this, I can move on to more complicated pages.
> You've got bad HTML. Look closely and you'll see the there's no space
> between the "revisionxxxx" strings and the style tag following.
> The parser doesn't like this. I don't know a solution other than
> fixing the html.
> (I created a local copy, edited it and it worked.)
Hello,

Use regular expression for bad HTLM or beautifulSoup (google it), below 
a exemple to extract all html links:

linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
     print link

Cheers
Karim