urllib2.urlopen(url) pulling something other than HTML

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Mon Aug 20 22:18:00 CEST 2007


On 20 ago, 15:44, "dogatemycompu... at gmail.com"
<dogatemycompu... at gmail.com> wrote:

> ----------------------------------------------------------
> f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
> parser = htmllib.HTMLParser(f)
>     parser.feed(html)
>     parser.close()
>     return parser.anchorlist
> ----------------------------------------------------------

The htmllib.HTMLParser class is hard to use. I would replace those
lines with:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.anchorlist = []

    def handle_starttag(self, tag, attrs):
        if tag=="a":
            href = dict(attrs).get("href")
            if href:
                self.anchorlist.append(href)

parser = MyHTMLParser()
parser.feed(htmltext)
print parser.anchorlist

The anchorlist attribute, defined by myself here, is a list containing
all href attributes found in the page.
See <http://docs.python.org/lib/module-HTMLParser.html>

> I get the idea that we're allocating some memory that looks like a
> file so formatter.dumbwriter can manipulate it.  The results are
> passed to formatter.abstractformatter which does something else to the
> HTML code.  The results are then passed to "f" which is then passed to
> htmllib.HTMLParser so it can parse the html for links.   I guess I
> don't understand with any great detail as to why this is happening.
> I know someone is going to say that I should RTFM so here is the gist
> of the documentation:

Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.

> The last question is..   I can't find any documentation to explain
> where the "anchorlist" attribute came from?   Here is the only
> reference to this attribute that I can find anywhere in the Python
> documentation.

And that's all you will find.

> So ..  How does an average developer figure out that parser returns a
> list of hyperlinks in an attribute called anchorlist?  Is this

Usually, those attributes are hyperlinked and you can find them in the
documentation index. Not for this one :(

> something that you just "figure out" or is there some book I should be
> reading that documents all of the attributes for a particular
> method?   It just seems a bit obscure and certainly not something I
> would have figured out on my own.  Does this make me a poor developer
> who should find another hobby?   I just need to know if there is
> something wrong with me or if this is a reasonable question to ask.

It's a very reasonable question. The attribute should be documented
properly. But the class itself is a bit old; I don't never use it
anymore.

> The last question I have is about debugging.   The spider is capable
> of parsing links until it reaches:
>
> "html = get_page(http://www.google.com/jobs/fortune)" which returns
> the contents of a pdf document, assigns the pdf contents to html which
> is later passed to parser.feed(html) which crashes.

You can verify the Content-Type header before processing. Quoting the
get_page method:

> def get_page(url, log):
>     """Retrieve URL and return comments, log errors."""
>     try:
>         page = urllib2.urlopen(url)
>     except urllib2.URLError:
>         log("Error retrieving: " + url)
>         return ''
>     body = page.read()
>     page.close()
>     return body

>From <http://docs.python.org/lib/module-urllib2.html>, the urlopen
method returns a file-like object, which has an additional info()
method holding the response headers. You can get the Content-Type
using page.info().gettype(), which should be text/html or text/xhtml.
For any other type, just return '' as you do for any error.

--
Gabriel Genellina




More information about the Python-list mailing list